HTML code conversion into Studio

Question

Hi Everyone - not sure is this is a simple question that I'm just not spotting, or more complex, but I'd be grateful for any suggestions - I'm guessing it's a Regex question but may be wrong. I have .xlsx files that contain HTML. I've been able to deal with basic tags using Pauls' article "Handling taggy Excel files in Studio" but what it doesn't seem to work with is understanding HTML code for special characters. 
 
 For example, I have &#39; which should read ' &#34; which should read " &#169 which should read &copy; 
 
 amongst others - is there a way to make Studio read those as the correct characters on opening the source file? 
 Thanks in advance for any help - and Happy New Year to you all. 
 Rob

Kyle Budd · Answer

Hi Rob, You are correct that this is a bit RegEx-y. The "Easy" solution is to add a new 'Tag definition rule' in the File Types settings for your Excel filter. It'll just be a Placeholder tag type: &[#\w\d]+?; (I also recommend changing the 'Advanced' options and set it to 'always include with text'.) This will convert the numeric entities into tags, but not actual characters. It may also cause some problems for your linguists since some languages may use apostrophes, quotes, etc. differently than English does. For example if they remove these place-able tags then the QA will start throwing errors, and if they add additional punctuation then that punctuation will not be converted into the appropriate escaped entity during target file generation. The quality may suffer a bit depending on how frequently these items appear. The "Hard" solution would actually require converting the file into a new format that would support the use of an actual HTML file type parser. This would enable true conversion of the escaped entities into actual characters both going into translation as well as for target file generation. This would provide better translatable content, translation memory entries, and translated files. This can be done a few different ways, though the simplest methodology will be determined by the content and structure of your XLSX files. 
 
 * Edited - added ? to the RegEx. Could be dangerous without.

Paul · Answer

Robert Whitaker said: I assume your "Hard" option should resolve that, but I set up a schema.xml file using Evzen's suggestion, and saved the spreadsheet out into XML - the conversion worked fine, but when I opened that in Trados, I was back to the initial stage of no tags being recognised. 
 Hi Robert Whitaker 
 The advantage of the that way is that you can use the html filetype as an embedded filetype inside the XML. If you do that all of your tags will be automatically handled exactly as you desire. 
 It's a little long, but this article explains what you need to know: 
 https://multifarious.filkin.com/2014/06/01/custom-xml/ 
 Regards 
 Paul

Trados Studio > 5. Regex and XPath

HTML code conversion into Studio