HTML code conversion into Studio

Hi Everyone - not sure is this is a simple question that I'm just not spotting, or more complex, but I'd be grateful for any suggestions - I'm guessing it's a Regex question but may be wrong.  I have .xlsx files that contain HTML.  I've been able to deal with basic tags using Pauls' article "Handling taggy Excel files in Studio" but what it doesn't seem to work with is understanding HTML code for special characters. 
 
For example, I have
'  which should read '
" which should read "
&#169 which should read ©
 
amongst others - is there a way to make Studio read those as the correct characters on opening the source file?
Thanks in advance for any help - and Happy New Year to you all.
Rob
Parents
  • Hi Rob,
    You are correct that this is a bit RegEx-y.

    The "Easy" solution is to add a new 'Tag definition rule' in the File Types settings for your Excel filter. It'll just be a Placeholder tag type:
    &[#\w\d]+?;
    (I also recommend changing the 'Advanced' options and set it to 'always include with text'.)

    This will convert the numeric entities into tags, but not actual characters. It may also cause some problems for your linguists since some languages may use apostrophes, quotes, etc. differently than English does.
    For example if they remove these place-able tags then the QA will start throwing errors, and if they add additional punctuation then that punctuation will not be converted into the appropriate escaped entity during target file generation. The quality may suffer a bit depending on how frequently these items appear.


    The "Hard" solution would actually require converting the file into a new format that would support the use of an actual HTML file type parser. This would enable true conversion of the escaped entities into actual characters both going into translation as well as for target file generation. This would provide better translatable content, translation memory entries, and translated files.
    This can be done a few different ways, though the simplest methodology will be determined by the content and structure of your XLSX files.

     

    * Edited - added ? to the RegEx.  Could be dangerous without.

  • Unknown said:
    The "Hard" solution would actually require converting the file into a new format that would support the use of an actual HTML file type parser.

    This is in fact usually not that hard at all - simple conversion to XML (and back to Excel) using the process described at http://www.excel-easy.com/examples/xml.html is usually enough.

Reply Children