HTML code conversion into Studio

Hi Everyone - not sure is this is a simple question that I'm just not spotting, or more complex, but I'd be grateful for any suggestions - I'm guessing it's a Regex question but may be wrong.  I have .xlsx files that contain HTML.  I've been able to deal with basic tags using Pauls' article "Handling taggy Excel files in Studio" but what it doesn't seem to work with is understanding HTML code for special characters. 
 
For example, I have
'  which should read '
" which should read "
&#169 which should read ©
 
amongst others - is there a way to make Studio read those as the correct characters on opening the source file?
Thanks in advance for any help - and Happy New Year to you all.
Rob
Parents
  • Hi Rob,
    You are correct that this is a bit RegEx-y.

    The "Easy" solution is to add a new 'Tag definition rule' in the File Types settings for your Excel filter. It'll just be a Placeholder tag type:
    &[#\w\d]+?;
    (I also recommend changing the 'Advanced' options and set it to 'always include with text'.)

    This will convert the numeric entities into tags, but not actual characters. It may also cause some problems for your linguists since some languages may use apostrophes, quotes, etc. differently than English does.
    For example if they remove these place-able tags then the QA will start throwing errors, and if they add additional punctuation then that punctuation will not be converted into the appropriate escaped entity during target file generation. The quality may suffer a bit depending on how frequently these items appear.


    The "Hard" solution would actually require converting the file into a new format that would support the use of an actual HTML file type parser. This would enable true conversion of the escaped entities into actual characters both going into translation as well as for target file generation. This would provide better translatable content, translation memory entries, and translated files.
    This can be done a few different ways, though the simplest methodology will be determined by the content and structure of your XLSX files.

     

    * Edited - added ? to the RegEx.  Could be dangerous without.

  • Thanks Kyle - I thought that was the answer, but it could never be that simple! I tried your "easy" solution, and it locked the codes as tags, which looked good. However, the target language is Arabic (source is English) and when I "save as target" the codes come out reversed - e.g. ;93#&. In any case, in Arabic abbreviations are not generally used, so we need to translate the symbol ' as the Arabic word for feet. On that basis the translator needs to see the symbol, rather than the code or tag - e.g. 5'9" or 5 feet nine inches - so they can understand and translate it into complete words, rather than repeating the abbreviated symbols.

    I assume your "Hard" option should resolve that, but I set up a schema.xml file using Evzen's suggestion, and saved the spreadsheet out into XML - the conversion worked fine, but when I opened that in Trados, I was back to the initial stage of no tags being recognised.

    I'm guessing there that I need to set up a parser in the XML Embedded content area - I tried playing with that, but with no success, either for standard HR+TML tags, or converting the feet symbols.

    As you'll gather, I'm new to this aspect of Studio - very grateful for any further help you can offer.
  • Robert Whitaker said:
    I assume your "Hard" option should resolve that, but I set up a schema.xml file using Evzen's suggestion, and saved the spreadsheet out into XML - the conversion worked fine, but when I opened that in Trados, I was back to the initial stage of no tags being recognised.

    Hi  

    The advantage of the that way is that you can use the html filetype as an embedded filetype inside the XML.  If you do that all of your tags will be automatically handled exactly as you desire.

    It's a little long, but this article explains what you need to know:

    https://multifarious.filkin.com/2014/06/01/custom-xml/

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

Reply Children
No Data