Unicode character in UTF-8 (with BOM) encoded TXT disappears in Trados 2014

Hi everyone, 

I am new to the forum, and I hope you are able to help me with a technical issue.

I just loaded a .txt file in Trados like the one in the screenshot below (as visualized with Notepad++):

 

 

Unfortunately Trados doesn't read the black US character and the text appears as follow:

~GANTT_COLOR_DELIMITER_DAYdescription\~
~GANTT_COLOR_DELIMITER_DAYdisplayname\~
~GANTT_COLOR_DELIMITER_HOURdescription\~
~GANTT_COLOR_DELIMITER_HOURdisplayname\~

This creates problems when I want to save the target file (which I am saving as .txt with UTF-8 encoding), as the US character appears to be gone, also when I open the file with Notepad ++.

Is there something that I should do in order to visualize the character correctly in Trados and not loosing it in the target file?

I would appreciate very much any hint or help.

Best,

Annalisa

  • Hi Annalisa,

    Can we see an example file to test? You can attach it to here or email it to pfilkin@sdl.com if it's acceptable to share the file?

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Great! Yes, I just sent you the file.

    Best regards,

    Annalisa
  • Hi ,

    I have looked at this file with the development team as I had the same problem as you and could not make this work with Studio 2017 either. It's an interesting problem which was explained to me as follows.

    The FileTypeSupport.Framework removes any character which is NOT one of the following: (Regex) "[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]"

    The reasons for this is that any character which is not described in the above ranges is INVALID in XML. Thus, they are filtered out by the Framework as we use XML for the SDLXLIFF and the Translation Memory content. Investigations we have carried out in the past have shown that the Translation Memory breaks in Studio when certain UTF-16 unicode surrogate pairs are allowed (such as the ones you have in your file).

    So for the time being my suggestion would be to search and replace these characters with something recognisable that you can replace later. So you could use a tab character for example, or maybe invent a unique tag like this perhaps:

    <US>

    If they are brought into the editor and not converted to structure anyway (as your example does not look like translatable text to me) then you could convert the <US> into a studio tag to make them easier to handle and then they'll be easy to find in the target file later on.

    Maybe someone has a better idea to resolve this but for now the important thing is to note that you can only deal with this using a workaround.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hello,

    I am interested to join this discussion, and I have similar question about processing "unidentified" character in SDL Trados.

    So, I am experimenting to open a file contains Javanese character (font name: Hanacaraka). But Trados couldn't read the texts properly. I want to know if there is a way to open the file and the Javanese characters can be shown.

    In Option menu, under Font Adaptation, I have set the Custom Language Font as: Language = Javanese (Indonesian), Font = Hanacaraka. But the Javanese characters still appear as Latin character.

    Is it because the character/font not supported by Trados yet? Is there any way to process this character? Thank you.

    tes-jawa.docx

    Fonts.zip

  • Hi ,

    If that word file is supposed to be the source then I think that would be the problem as it's also in latin characters. If not can we see the source?

    Thanks

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • It displays fine here in my Word 2010... after installing the fonts...

  • You are absolutely right Evzen.  I opened the file, saw the name of the font in Word, and foolishly assumed it must have been already available.

    I think perhaps the Hanacaraka font might not be fully unicode compliant... at least I'm assuming this because Studio can't see it.  I will raise it with support and have it investigated.  I also installed Tuladha Jejeg and wrote a few Javanese letters (I think) into your word file with this font.  When I open this in Studio I see this:

    If I did this correctly then perhaps Tuladha Jejeg is a better option?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • The very last character looks suspicious... it looks very similar to (incorrectly displayed) character discussed a while ago by someone else... (can't remember the details and the forum search terribly sucks).
    I wonder if someone actually able to read it can confirm if it displays correctly or not ;-)
  • Dear all,

    Awesome! It works. Tuladha Jejeg is not only a better option, but it is the best option :)

    The only problem (for me, personally) is about the character shapes that are different from basic Hancaraka. I think I need to 're-adjust' my writing style :)

    Thank you so much.