Unicode character in UTF-8 (with BOM) encoded TXT disappears in Trados 2014

Question

Hi everyone, 
 I am new to the forum, and I hope you are able to help me with a technical issue. 
 I just loaded a .txt file in Trados like the one in the screenshot below (as visualized with Notepad++):

Unfortunately Trados doesn't read the black US character and the text appears as follow: 
 ~GANTT_COLOR_DELIMITER_DAYdescription\~ ~GANTT_COLOR_DELIMITER_DAYdisplayname\~ ~GANTT_COLOR_DELIMITER_HOURdescription\~ ~GANTT_COLOR_DELIMITER_HOURdisplayname\~ 
 This creates problems when I want to save the target file (which I am saving as .txt with UTF-8 encoding), as the US character appears to be gone, also when I open the file with Notepad ++. 
 Is there something that I should do in order to visualize the character correctly in Trados and not loosing it in the target file? 
 I would appreciate very much any hint or help. 
 Best, 
 Annalisa

Paul · Accepted Answer

You are absolutely right Evzen. I opened the file, saw the name of the font in Word, and foolishly assumed it must have been already available. 
 I think perhaps the Hanacaraka font might not be fully unicode compliant... at least I'm assuming this because Studio can't see it. I will raise it with support and have it investigated. I also installed Tuladha Jejeg and wrote a few Javanese letters (I think) into your word file with this font. When I open this in Studio I see this: 
 
 If I did this correctly then perhaps Tuladha Jejeg is a better option?

Paul · Answer

Hi Annalisa Murara ,

I have looked at this file with the development team as I had the same problem as you and could not make this work with Studio 2017 either. It's an interesting problem which was explained to me as follows.

The FileTypeSupport.Framework removes any character which is NOT one of the following: (Regex) "[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]"

The reasons for this is that any character which is not described in the above ranges is INVALID in XML. Thus, they are filtered out by the Framework as we use XML for the SDLXLIFF and the Translation Memory content. Investigations we have carried out in the past have shown that the Translation Memory breaks in Studio when certain UTF-16 unicode surrogate pairs are allowed (such as the ones you have in your file).

So for the time being my suggestion would be to search and replace these characters with something recognisable that you can replace later. So you could use a tab character for example, or maybe invent a unique tag like this perhaps:

<US>

If they are brought into the editor and not converted to structure anyway (as your example does not look like translatable text to me) then you could convert the <US> into a studio tag to make them easier to handle and then they'll be easy to find in the target file later on.

Maybe someone has a better idea to resolve this but for now the important thing is to note that you can only deal with this using a workaround.

Regards

Paul

Paul Filkin | RWS Group

________________________
Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub

Trados Studio > 1. Trados Studio

Unicode character in UTF-8 (with BOM) encoded TXT disappears in Trados 2014