HTML code embedded in Excel file

Hello,

I need help with a tricky Excel file. I have activated this in Studio already and it has helped a bit:

Screenshot of Trados Studio project settings showing 'Contenu incorpore' highlighted with regular expression rules for embedded content.

But as you can see, there is still HTML code embedded in the text (like for non breaking spaces, typographic apostrophes, em-dash, etc.):

Screenshot of an Excel file in Trados Studio with HTML code visible, such as ' ' for non-breaking spaces and '—' for em-dash.

Is there a way to deal with that in Studio 2022? Here is a sample of the file in question:

Sample Excel.xlsx

 I know you have answered a few questions like this one already, but I can't find anything on embedded HTML code with & and ;

Thank you!



Generated Image Alt-Text
[edited by: RWS Community AI at 4:02 AM (GMT 1) on 24 Oct 2024]
emoji
Parents
  •  

    While 's solution is likely the most efficient in terms of time spent dealing with the HTML entities, it means you end up with tags, so it'll be "Please note that we<TAG>re aware that...". And if you don't have that tag (an apostrophe) in the target, you'll get an error message at validation. It's fine, it's the quickest way to handle the entities.

    But since they are characters encoded, you could convert them into characters with some kind of find-and-replace process. Then you have the actual characters in the editor, which is much nicer to work with and you get much better TM leverage. It's quite possible that there are only a few different entities in your source, such as m-dashes, n-dashes, fixed spaces and apostrophes.

    I think CleanUp Task would be the best option for this. There are forbidden characters in HTML, these must be converted back:

    < (&lt;), > (&gt;), & (&amp;), " (&quot;), ' (&apos; or &#39;)

    But make sure you do this only to the text, not to the code. I think CleanUp task only handles the text, that's why I recommended it.

    Good luck!

    emoji
Reply
  •  

    While 's solution is likely the most efficient in terms of time spent dealing with the HTML entities, it means you end up with tags, so it'll be "Please note that we<TAG>re aware that...". And if you don't have that tag (an apostrophe) in the target, you'll get an error message at validation. It's fine, it's the quickest way to handle the entities.

    But since they are characters encoded, you could convert them into characters with some kind of find-and-replace process. Then you have the actual characters in the editor, which is much nicer to work with and you get much better TM leverage. It's quite possible that there are only a few different entities in your source, such as m-dashes, n-dashes, fixed spaces and apostrophes.

    I think CleanUp Task would be the best option for this. There are forbidden characters in HTML, these must be converted back:

    < (&lt;), > (&gt;), & (&amp;), " (&quot;), ' (&apos; or &#39;)

    But make sure you do this only to the text, not to the code. I think CleanUp task only handles the text, that's why I recommended it.

    Good luck!

    emoji
Children