XML files containing HTML encodings

Hi,

We encounter a problem when analysing a French XML file in SDL Studio 2017
We cannot get the French characters with accents right.  In the source file they use an encoding for these characters and in Studio these encodings appear as tags.
Do you know a solution for this problem?

Kind regards and thanks in advance,
Margo

Parents
  • Check the entity processing in your HTML parser. This can be the reason for wrong characters. Or please post here an extract of your XML with the corresponding text with those French letters.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Hi Jerzy,

    Thanks for your reaction.

    I already tried several HTML entity settings, but all in vain up to now.

    Please find an extract below.

    Kind regards,

    Margo

     

     

    <sitecore>
     
      <phrase path="/sitecore/content/Sites/France/Home/Self storage in France/Avignon/Avignon" key="Avignon" itemid="{B48AB076-9DF9-4DA3-A130-50E3DD8B3E2D}" fieldid="General information" updated="20170626T083947Z">
        <fr-FR>&lt;h2&gt;&amp;Agrave; propos de Shurgard Avignon&lt;/h2&gt;
    &lt;p&gt;Notre centre de stockage &amp;agrave; Avignon a r&amp;eacute;cemment &amp;eacute;t&amp;eacute; renouvel&amp;eacute;. Le site offre toutes les fonctionnalit&amp;eacute;s n&amp;eacute;cessaires pour r&amp;eacute;pondre &amp;agrave; vos besoins.&amp;nbsp;&lt;/p&gt;
    &lt;ul&gt;
        &lt;li&gt;672 espaces de stockage au rez-de-chauss&amp;eacute; ou &amp;agrave; l'&amp;eacute;tage.&lt;/li&gt;
        &lt;li&gt;Box de stockage avec acc&amp;egrave;s direct permettant de vous garer devant la porte.&lt;/li&gt;
        &lt;li&gt;Ascenseur et chariots pour d&amp;eacute;placer facilement vos affaires.&lt;/li&gt;
        &lt;li&gt;Un acc&amp;egrave;s direct int&amp;eacute;rieur qui peut accueillir tout type de v&amp;eacute;hicule.&lt;/li&gt;
        &lt;li&gt;Des places de parking disponibles juste devant l'accueil.&lt;/li&gt;
    &lt;/ul&gt;
    &lt;p&gt;Informations pour acc&amp;eacute;der au centre de stockage&amp;nbsp;:&lt;/p&gt;
    &lt;ul&gt;
        &lt;li&gt;Shurgard est situ&amp;eacute; proche du centre commercial Cap Sud.&lt;/li&gt;
        &lt;li&gt;Par la route de Marseille, prenez le rond-point du Lac de Saint Chamand.&lt;/li&gt;
        &lt;li&gt;2 arr&amp;ecirc;ts de bus en direction du centre-ville d'Avignon ou de Montfavet.&lt;/li&gt;
    &lt;/ul&gt;</fr-FR>
      </phrase>
      
     
    </sitecore>

  • This is the source of your XML?
    So in that case you would need to process the entities before you open the file for translation. I see it is Sitecore - I have dealt with that format quite a long time ago. If you like, please send me a complete file to jerzy at czopik dot com to create a file type for it.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Hi  

    If all your files look like this all you need to do is process the fr-FR element with an embedded content rule and the html filter will handle these entities nicely.  Try the attached:

    Margo Van Thienen.zip

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • BTW, I really wonder what was the person with the "bright" idea to name the element by the language abbreviation actually smoking :(
    I hope that "smart guy" will burn in hell forever...

    EDIT:
    Just in case you don't get what I'm talking about...
    Imagine you get source files in English, supposed to be localized into a dozen of languages... so after translation you get buch of files in each target language, but all still containing the translations in <en-US> elements...
    So one needs to include extra prost-processing step in the process... just because someone back at Sitecore didn't bother to use brain :-(

  • I guess they're not using a TMS connector to handle the export/import process automatically, so are down to search & replacing the language codes... preferably before sending them out.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • We've sorted that out off forum.
    The problem was, that the HTML parser did not convert the entities to text while reading the file. That was really all. After having activated the entity conversion for Added Latin 1 and HTML Special the problem disappeared.
    From my personal experience problems with letters showing wrong can be caused first by saving the file with wrong encoding, then by some missing or superfluous entity conversions. In first case it is enough to recode to UTF-8 with BOM (BTW, Studio does not recognize UTF-8 without BOM, why?) and in the second case one needs to play with the entities settings in his file type.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Hi Jerzy,

    Do you have an example file where Studio doesn't recognise UTF-8 without a BOM?

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Bei kürzerem täglichen Betrieb sind die Wartungszeiten entsprechend anzupassen (relevant sind die angegebenen Abstände in Betriebsstunden).
    Eine Verkürzung der Intervalle kann notwendig sein, wenn die Betriebserfahrungen im konkreten Prozess zeigen, dass die nachfolgend genannten Intervalle zu lang sind bzw. die Reinigungswirkung der Anlage vorzeitig zu stark nachlässt.
    Für die Überwachung der Intervalle ist der Betriebsstundenzähler im Schaltschrank zu beachten.
    Is attached - a plain text file in UTF-8 will not be properly recognized. Of course this is a very simple example of a text with no embedded content. But we get such texts with "tags", so I create a proper file type then, but have to resave the file as UTF-8 with BOM, otherwise all non-ASCI characters are corrupt.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • How would Studio know this is UTF-8 Jerzy? There is no declaration and no BOM.  The fact there are German chars in there means it could one of many different charsets and even a best guess could be wrong.  I suppose we could guess at UTF-8 as that's a reasonable catch all or you may be able to use the setting in Studio to add the BOM if it's not present.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • But how shall we then deal with such files?

    In my case I resave as UTF-8 with BOM, but never know if that is fine for the customer. Because no one complained yet, I think that is ok...

    But just in case: why does Studio not simply assume it is UTF, but instead uses ASCII or ANSI? I think UTFis the most common encoding now, so it would be much easier to assume this and change if necessary, than the other way round. Just my 2 cents.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

Reply
  • But how shall we then deal with such files?

    In my case I resave as UTF-8 with BOM, but never know if that is fine for the customer. Because no one complained yet, I think that is ok...

    But just in case: why does Studio not simply assume it is UTF, but instead uses ASCII or ANSI? I think UTFis the most common encoding now, so it would be much easier to assume this and change if necessary, than the other way round. Just my 2 cents.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

Children
No Data