Entity Handling in SDL Trados Studio; you cannot read in all entities as actual characters into the translation interface and write them out as the actual characters as well.

I have tried to get guidance from SDL Enterprise support but so far it has mostly been work arounds but nothing that really makes sense.


1. The default HTML filter will read in HTML entities such as &reg as a tag for the Editor (translation interface). I do not understand why this would be the default behavior considering it is just another normal character and most TMs will have the actual ® character and thus cause mismatches with TM.

2. Regardless of whether I turn off Enable entity conversion in the HTML 5 filter or uncheck it under Numeric and Special Graphic > apos, Studio will always write out ' as &apos in the target translated file. &apos is not officially supported in older browsers or some email clients and thus may not display correctly, yet I cannot get Studio just write it out as '.

3. Considering most HTML is now Unicode (UTF-8) I don't understand the need for the use of entities except for characters that are specifically not allowed. Yet according to Studio help (http://producthelp.sdl.com/SDL_Trados_Studio_2015/client_en/HTML_Entities.htm):

If entity conversion is enabled, Studio converts all the character entity references (or numeric entity reference) listed under Entity Mappings that it finds in HTML documents to their character representations. Before writing the target file, Studio converts these characters back to their character entity form.
For example, if entity conversion is enabled, the character entity reference & in the source file will be displayed as the character & in the Editor. Any occurrence of & will be written as &.
If a character entity is not selected for conversion, the character entity, rather than the character, is used in the Editor.

Ideally, we would ALWAYS want to see the actual characters in the Editor (Translation Interface) and store them that way in TM, so I would be tempted to turn on entity conversion for ALL supported entities because I never know what entities the source files will contain. However, I would like to write them out as the actual characters in all cases except if something is actually not supported. But that does not seem to be possible. I am stuck with a dilemma; either have tags representing normal characters like ® in the Editor or have them written out as &reg in the target file. These should be two separate settings; one for how we read them in and one for how we write them out in the translated file.

  • After a conversation with support, it does seem like you can choose how to read and write entities individually via Options > File Types > HTML 5 > Entities > Advanced... but it is only available for: lt, gt, quot, apos & amp. It is also odd since I was previously having trouble turning off entity conversion for apos from the normal table under Numeric and Special Graphic section where apos is listed since it is actually controlled from this Advanced... setting instead. Why include it in both places if only one setting will actually get read?

    The option to read the entity in as the actual character into the Editor(Translation interface) and write it out as I choose should be available for all entities, don't you think?

  • Unknown said:
    The option to read the entity in as the actual character into the Editor(Translation interface) and write it out as I choose should be available for all entities, don't you think?

    Hi Octavio,

    Yes I do think so... I'll raise it with the filetype developers.

    Thanks

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Thanks Paul. I appreciate it. If I only got HTML source from a single location that always gave me entities or the actual character this would not be such a major issue but considering content is created and consumed from various tools/browsers we need to have some flexibility in how things are read and written.
  • Hi Paul, I wonder if anything happened with this issue. It seems that now this is possible in the XML2 file type. YOu can define how to read and write all entities individually but unfortunately, this goodness is not part of the HTML and more importantly the Embedded HTML file type. Would the only option here be to create my very own HTML parser using the XML2 file type or is this something we should expect to come to HTML sometime soon?