Character entities in CDATA elements

Hey everyone,

I'm struggling to convert character entities in the CDATA section of an XML file. The content is in German, so there are umlauts all over the place...

The source file is inconsistent in the way it annotates the entities: sometimes it uses the numerical annotations, sometimes the alphanumerical.

I know character entities are not needed in CDATA elements, but they're there nonetheless...

Here's a sample of the XML file:

<ARTICLE>
   <MAIN>
      <TI>Ursachen erkennen und beheben: Tr&#228;nkenh&ouml;he nicht angepasst</TI>
      <TE>
         <![CDATA[<p><strong>T&#228;gliche Anpassung der Tr&auml;nkenh&ouml;he ist n&ouml;tig!</strong></p>&#10;]]>
      </TE>
   </MAIN>
</ARTICLE>

The entities in the TI element are converted correctly. The ones in the CDATA element are not.

Screenshot of Trados Studio showing German text with character entities in the CDATA section not converted correctly.

Is there any way Studio can convert them?



Generated Image Alt-Text
[edited by: Trados AI at 10:22 PM (GMT 0) on 28 Feb 2024]
emoji
Parents
  • The file was created in a custom CMS system. I'm guessing the system is a bit flawed...

    Converting into tags is not going to work, because the entities represent characters (mostly ä, ü, and ö). I could do that via RegExes in the Tag definition rules of the Embedded content processor. However, they're part of words that need be translated. If I convert them into tags, I'll end up with error messages about missing tags.

    I'll get in touch with the client and try to convince him to clean it up at the source...

    Thanks for the info! Knowing that it's not possible is already a great help! (c:

  • There is a plugin named "HTMLTag" for Notepad++, which can convert HTML entities back and forth. You can use this plugin to clean the sources by yourself. If it's not thousands of files, it should be pretty easy task.

  • Hi Evzen,

    Thanks for the tip! I tried it, but unfortunately, the plugin allows to encode/decode the numerical value, but the alphanumerical entities (&ouml;) are ignored.

  • Hi Ken, 

    Working fine using both XML v1 and XML v2, Studio recognize both HTML entity (&auml;) and the entity in Decimal format (&#228;).

    Try using a HTML 5.2 embedded content procesor for that element or elements for CData section.

    Trados Studio project file type settings showing CDATA section rules highlighted in red.

    Preview of embedded content processing in Trados Studio with German text showing no visible errors.

    An in the HTML content processor set the neccesary entities,

    Entity conversion settings in Trados Studio with HTML 5 selected and entity mapping list visible.

    If you need the entities to be written back in Decimal format (&#228;)you may have a problem. I don't think Studio can do it. I need that format for some of my projects and I am converting the entities using a python script after exporting the files from Studio.

    I hope it helps, 

    Felipe

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 10:22 PM (GMT 0) on 28 Feb 2024]
Reply
  • Hi Ken, 

    Working fine using both XML v1 and XML v2, Studio recognize both HTML entity (&auml;) and the entity in Decimal format (&#228;).

    Try using a HTML 5.2 embedded content procesor for that element or elements for CData section.

    Trados Studio project file type settings showing CDATA section rules highlighted in red.

    Preview of embedded content processing in Trados Studio with German text showing no visible errors.

    An in the HTML content processor set the neccesary entities,

    Entity conversion settings in Trados Studio with HTML 5 selected and entity mapping list visible.

    If you need the entities to be written back in Decimal format (&#228;)you may have a problem. I don't think Studio can do it. I need that format for some of my projects and I am converting the entities using a python script after exporting the files from Studio.

    I hope it helps, 

    Felipe

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 10:22 PM (GMT 0) on 28 Feb 2024]
Children