XML files containing HTML encodings

Hi,

We encounter a problem when analysing a French XML file in SDL Studio 2017
We cannot get the French characters with accents right.  In the source file they use an encoding for these characters and in Studio these encodings appear as tags.
Do you know a solution for this problem?

Kind regards and thanks in advance,
Margo

Parents
  • Check the entity processing in your HTML parser. This can be the reason for wrong characters. Or please post here an extract of your XML with the corresponding text with those French letters.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Hi Jerzy,

    Thanks for your reaction.

    I already tried several HTML entity settings, but all in vain up to now.

    Please find an extract below.

    Kind regards,

    Margo

     

     

    <sitecore>
     
      <phrase path="/sitecore/content/Sites/France/Home/Self storage in France/Avignon/Avignon" key="Avignon" itemid="{B48AB076-9DF9-4DA3-A130-50E3DD8B3E2D}" fieldid="General information" updated="20170626T083947Z">
        <fr-FR>&lt;h2&gt;&amp;Agrave; propos de Shurgard Avignon&lt;/h2&gt;
    &lt;p&gt;Notre centre de stockage &amp;agrave; Avignon a r&amp;eacute;cemment &amp;eacute;t&amp;eacute; renouvel&amp;eacute;. Le site offre toutes les fonctionnalit&amp;eacute;s n&amp;eacute;cessaires pour r&amp;eacute;pondre &amp;agrave; vos besoins.&amp;nbsp;&lt;/p&gt;
    &lt;ul&gt;
        &lt;li&gt;672 espaces de stockage au rez-de-chauss&amp;eacute; ou &amp;agrave; l'&amp;eacute;tage.&lt;/li&gt;
        &lt;li&gt;Box de stockage avec acc&amp;egrave;s direct permettant de vous garer devant la porte.&lt;/li&gt;
        &lt;li&gt;Ascenseur et chariots pour d&amp;eacute;placer facilement vos affaires.&lt;/li&gt;
        &lt;li&gt;Un acc&amp;egrave;s direct int&amp;eacute;rieur qui peut accueillir tout type de v&amp;eacute;hicule.&lt;/li&gt;
        &lt;li&gt;Des places de parking disponibles juste devant l'accueil.&lt;/li&gt;
    &lt;/ul&gt;
    &lt;p&gt;Informations pour acc&amp;eacute;der au centre de stockage&amp;nbsp;:&lt;/p&gt;
    &lt;ul&gt;
        &lt;li&gt;Shurgard est situ&amp;eacute; proche du centre commercial Cap Sud.&lt;/li&gt;
        &lt;li&gt;Par la route de Marseille, prenez le rond-point du Lac de Saint Chamand.&lt;/li&gt;
        &lt;li&gt;2 arr&amp;ecirc;ts de bus en direction du centre-ville d'Avignon ou de Montfavet.&lt;/li&gt;
    &lt;/ul&gt;</fr-FR>
      </phrase>
      
     
    </sitecore>

  • Hi  

    If all your files look like this all you need to do is process the fr-FR element with an embedded content rule and the html filter will handle these entities nicely.  Try the attached:

    Margo Van Thienen.zip

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • BTW, I really wonder what was the person with the "bright" idea to name the element by the language abbreviation actually smoking :(
    I hope that "smart guy" will burn in hell forever...

    EDIT:
    Just in case you don't get what I'm talking about...
    Imagine you get source files in English, supposed to be localized into a dozen of languages... so after translation you get buch of files in each target language, but all still containing the translations in <en-US> elements...
    So one needs to include extra prost-processing step in the process... just because someone back at Sitecore didn't bother to use brain :-(

  • I guess they're not using a TMS connector to handle the export/import process automatically, so are down to search & replacing the language codes... preferably before sending them out.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • We've sorted that out off forum.
    The problem was, that the HTML parser did not convert the entities to text while reading the file. That was really all. After having activated the entity conversion for Added Latin 1 and HTML Special the problem disappeared.
    From my personal experience problems with letters showing wrong can be caused first by saving the file with wrong encoding, then by some missing or superfluous entity conversions. In first case it is enough to recode to UTF-8 with BOM (BTW, Studio does not recognize UTF-8 without BOM, why?) and in the second case one needs to play with the entities settings in his file type.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Hi Jerzy,

    Do you have an example file where Studio doesn't recognise UTF-8 without a BOM?

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Bei kürzerem täglichen Betrieb sind die Wartungszeiten entsprechend anzupassen (relevant sind die angegebenen Abstände in Betriebsstunden).
    Eine Verkürzung der Intervalle kann notwendig sein, wenn die Betriebserfahrungen im konkreten Prozess zeigen, dass die nachfolgend genannten Intervalle zu lang sind bzw. die Reinigungswirkung der Anlage vorzeitig zu stark nachlässt.
    Für die Überwachung der Intervalle ist der Betriebsstundenzähler im Schaltschrank zu beachten.
    Is attached - a plain text file in UTF-8 will not be properly recognized. Of course this is a very simple example of a text with no embedded content. But we get such texts with "tags", so I create a proper file type then, but have to resave the file as UTF-8 with BOM, otherwise all non-ASCI characters are corrupt.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • How would Studio know this is UTF-8 Jerzy? There is no declaration and no BOM.  The fact there are German chars in there means it could one of many different charsets and even a best guess could be wrong.  I suppose we could guess at UTF-8 as that's a reasonable catch all or you may be able to use the setting in Studio to add the BOM if it's not present.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • But how shall we then deal with such files?

    In my case I resave as UTF-8 with BOM, but never know if that is fine for the customer. Because no one complained yet, I think that is ok...

    But just in case: why does Studio not simply assume it is UTF, but instead uses ASCII or ANSI? I think UTFis the most common encoding now, so it would be much easier to assume this and change if necessary, than the other way round. Just my 2 cents.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Unknown said:
    How would Studio know this is UTF-8 Jerzy? There is no declaration and no BOM.

    How? Analyze the content, of course. That's how processing of such files was designed and intended... and that's what other tools able to handle such files actually do.

    Yes, it's a pain in the butt and the inventor of this idea should burn in hell forever, but that's how it is.

    EDIT:
    See https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8:
    If there is no BOM or other indication of the encoding, heuristic analysis is often able to reliably determine whether UTF-8 is in use due to the large number of byte sequences that are invalid in UTF-8. (When text is known to not be UTF-8, determining which legacy encoding can be difficult and uncertain. Several free libraries are available to ease the task, such as Mozilla Universal Charset Detector[9] and International Components for Unicode.[10])

    And I can say that I've been using the auto-detection in Netscape Suite (now Seamonkey) since late 90's and it always detected the charset reliably.

  • as

    Unknown said:
    How? Analyze the content, of course. That's how processing of such files was designed and intended... and that's what other tools able to handle such files actually do.

    Show me an application that always gets this right and I'll agree with you Evzen.  Note that I'm trying to be polite here and would like to find a reasonable conclusion without being aggressive.  UTF-8 is not always the right answer to a file that has no declaration and no BOM.  We could try to guess and we could assume UTF-8 (as I already said), so these are interesting things to discuss with the development team.

    The problem is always that if the original file was not intended to be UTF-8 but some other encoding then it may be impossible to guess correctly even if we did analyse the file.  I know most of the time UTF-8 would be the right answer but not always... so maybe just adding the BOM if not present (using this option in the filetype settings) would do the trick anyway?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

Reply
  • as

    Unknown said:
    How? Analyze the content, of course. That's how processing of such files was designed and intended... and that's what other tools able to handle such files actually do.

    Show me an application that always gets this right and I'll agree with you Evzen.  Note that I'm trying to be polite here and would like to find a reasonable conclusion without being aggressive.  UTF-8 is not always the right answer to a file that has no declaration and no BOM.  We could try to guess and we could assume UTF-8 (as I already said), so these are interesting things to discuss with the development team.

    The problem is always that if the original file was not intended to be UTF-8 but some other encoding then it may be impossible to guess correctly even if we did analyse the file.  I know most of the time UTF-8 would be the right answer but not always... so maybe just adding the BOM if not present (using this option in the filetype settings) would do the trick anyway?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

Children
  • I don't know if there is and application which always gets it right.
    But there are many with very high success rate. And if they fail, they provide user with options to set the encoding manually... And the change is applied on the fly, of course.
    Notepad++, PSPad, this Mozilla framework... all these do a very good job. And they are all free, BTW... so if a free tool can do it...

    But please, if you start working on it, do it properly. Please. No half-baked "solutions" like the one you suggested (checkbox "assume UTF-8", or something).
    Yes, checking for possible UTF-8 without BOM would be a good first-check in the detection routine nowadays and I guess it would suffice for majority files these days. But the routine must continue the heuristic analysis further if this check fails.

  • Unknown said:
    But please, if you start working on it, do it properly. Please. No half-baked "solutions" like the one you suggested (checkbox "assume UTF-8", or something).
    Yes, checking for possible UTF-8 without BOM would be a good first-check in the detection routine nowadays and I guess it would suffice for majority files these days. But the routine must continue the heuristic analysis further if this check fails.

    Thanks for the tip on seamonkey.  I didn't try that but it did make me look for a tool that used the same universal charset detection library.  I found this one and must admit it's pretty impressive:

    https://encodingchecker.codeplex.com/

    On the whole Studio didn't have a problem with any of the contrived files I created unless I deliberately did something really silly.  I guess we already do some kind of analysis on the file already, but I don't really know the details... I will find out though.  Ultimately the decision on what's needed will be development and not me.

    On this topic I know we do need the ability to apply encoding on files you select in a project, and also on the saved target files.  You can do this for single files but not for a whole project.  This I believe will get worked on.  Until then this is also a handy tool to have in your armoury:

    http://appstore.sdl.com/language/app/fec-file-encoding-converter/788/

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub