XML files containing HTML encodings

Hi,

We encounter a problem when analysing a French XML file in SDL Studio 2017
We cannot get the French characters with accents right.  In the source file they use an encoding for these characters and in Studio these encodings appear as tags.
Do you know a solution for this problem?

Kind regards and thanks in advance,
Margo

  • But how shall we then deal with such files?

    In my case I resave as UTF-8 with BOM, but never know if that is fine for the customer. Because no one complained yet, I think that is ok...

    But just in case: why does Studio not simply assume it is UTF, but instead uses ASCII or ANSI? I think UTFis the most common encoding now, so it would be much easier to assume this and change if necessary, than the other way round. Just my 2 cents.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Unknown said:
    How would Studio know this is UTF-8 Jerzy? There is no declaration and no BOM.

    How? Analyze the content, of course. That's how processing of such files was designed and intended... and that's what other tools able to handle such files actually do.

    Yes, it's a pain in the butt and the inventor of this idea should burn in hell forever, but that's how it is.

    EDIT:
    See https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8:
    If there is no BOM or other indication of the encoding, heuristic analysis is often able to reliably determine whether UTF-8 is in use due to the large number of byte sequences that are invalid in UTF-8. (When text is known to not be UTF-8, determining which legacy encoding can be difficult and uncertain. Several free libraries are available to ease the task, such as Mozilla Universal Charset Detector[9] and International Components for Unicode.[10])

    And I can say that I've been using the auto-detection in Netscape Suite (now Seamonkey) since late 90's and it always detected the charset reliably.

  • as

    Unknown said:
    How? Analyze the content, of course. That's how processing of such files was designed and intended... and that's what other tools able to handle such files actually do.

    Show me an application that always gets this right and I'll agree with you Evzen.  Note that I'm trying to be polite here and would like to find a reasonable conclusion without being aggressive.  UTF-8 is not always the right answer to a file that has no declaration and no BOM.  We could try to guess and we could assume UTF-8 (as I already said), so these are interesting things to discuss with the development team.

    The problem is always that if the original file was not intended to be UTF-8 but some other encoding then it may be impossible to guess correctly even if we did analyse the file.  I know most of the time UTF-8 would be the right answer but not always... so maybe just adding the BOM if not present (using this option in the filetype settings) would do the trick anyway?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • I don't know if there is and application which always gets it right.
    But there are many with very high success rate. And if they fail, they provide user with options to set the encoding manually... And the change is applied on the fly, of course.
    Notepad++, PSPad, this Mozilla framework... all these do a very good job. And they are all free, BTW... so if a free tool can do it...

    But please, if you start working on it, do it properly. Please. No half-baked "solutions" like the one you suggested (checkbox "assume UTF-8", or something).
    Yes, checking for possible UTF-8 without BOM would be a good first-check in the detection routine nowadays and I guess it would suffice for majority files these days. But the routine must continue the heuristic analysis further if this check fails.

  • Unknown said:
    But please, if you start working on it, do it properly. Please. No half-baked "solutions" like the one you suggested (checkbox "assume UTF-8", or something).
    Yes, checking for possible UTF-8 without BOM would be a good first-check in the detection routine nowadays and I guess it would suffice for majority files these days. But the routine must continue the heuristic analysis further if this check fails.

    Thanks for the tip on seamonkey.  I didn't try that but it did make me look for a tool that used the same universal charset detection library.  I found this one and must admit it's pretty impressive:

    https://encodingchecker.codeplex.com/

    On the whole Studio didn't have a problem with any of the contrived files I created unless I deliberately did something really silly.  I guess we already do some kind of analysis on the file already, but I don't really know the details... I will find out though.  Ultimately the decision on what's needed will be development and not me.

    On this topic I know we do need the ability to apply encoding on files you select in a project, and also on the saved target files.  You can do this for single files but not for a whole project.  This I believe will get worked on.  Until then this is also a handy tool to have in your armoury:

    http://appstore.sdl.com/language/app/fec-file-encoding-converter/788/

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub