HTML code embedded in Excel file

Hello,

I need help with a tricky Excel file. I have activated this in Studio already and it has helped a bit:

Screenshot of Trados Studio project settings showing 'Contenu incorpore' highlighted with regular expression rules for embedded content.

But as you can see, there is still HTML code embedded in the text (like for non breaking spaces, typographic apostrophes, em-dash, etc.):

Screenshot of an Excel file in Trados Studio with HTML code visible, such as ' ' for non-breaking spaces and '—' for em-dash.

Is there a way to deal with that in Studio 2022? Here is a sample of the file in question:

Sample Excel.xlsx

 I know you have answered a few questions like this one already, but I can't find anything on embedded HTML code with & and ;

Thank you!



Generated Image Alt-Text
[edited by: RWS Community AI at 4:02 AM (GMT 1) on 24 Oct 2024]
emoji
  •  

    The regex expression provided there will cover most tags as inline individual tags. It is not html tagging, but will work in most cases. However, what you have are not tags, but so called entities. These are not covered by the regex expression. You can add &[^;]*?; to capture also the entities. This must be done BEFORE you add the Excel file.

    Other, maybe better solution, would be using Excel Multilingual, where you can add html filter directly to process all tags and entities in proper way. Should this be not feasible, I would modifiy the file type for Excel by adding tags like they are in html with tag pairs and so on. Means more work, but much better results.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

    emoji
  •  

    While 's solution is likely the most efficient in terms of time spent dealing with the HTML entities, it means you end up with tags, so it'll be "Please note that we<TAG>re aware that...". And if you don't have that tag (an apostrophe) in the target, you'll get an error message at validation. It's fine, it's the quickest way to handle the entities.

    But since they are characters encoded, you could convert them into characters with some kind of find-and-replace process. Then you have the actual characters in the editor, which is much nicer to work with and you get much better TM leverage. It's quite possible that there are only a few different entities in your source, such as m-dashes, n-dashes, fixed spaces and apostrophes.

    I think CleanUp Task would be the best option for this. There are forbidden characters in HTML, these must be converted back:

    < (&lt;), > (&gt;), & (&amp;), " (&quot;), ' (&apos; or &#39;)

    But make sure you do this only to the text, not to the code. I think CleanUp task only handles the text, that's why I recommended it.

    Good luck!

    emoji