Problem with xliff 2.0 filetype

I have a problem when importing a file with the xliff 2.0 filetype in Studio 2021.

The file itself is fine. The import in memoq for example works absolutely fine.

in Studio, there is a problem with the segmentation. All sentences in one segment:

Screenshot of Trados Studio showing incorrect segmentation with all sentences combined into one segment.

I guess this is a bug. Any comments here?



Generated Image Alt-Text
[edited by: Trados AI at 5:45 AM (GMT 0) on 29 Feb 2024]
emoji
  • I think the only way to answer this question is to see the file.  Can you share it, or at least cut it down in a text editor and share the file with only a few segments that behave badly for you?  This way you could also anonymise the text if necessary.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  • <?xml version="1.0" encoding="UTF-8"?>
    <xliff xmlns="urn:oasis:names:tc:xliff:document:2.0"
           version="2.1"
           srcLang="de-DE"
           trgLang="en">
       <file id="file">
          <unit id="dialog">
             <segment id="dialog.name">
                <source>Ich bin der Titel des Formulars.</source>
             </segment>
          </unit>
          <unit id="dialog.demo-assistent-sandstein">
             <originalData>
                <data id="d5e2-start"><![CDATA[<i>]]></data>
                <data id="d5e2-end"><![CDATA[</i>]]></data>
                <data id="d5e5-start"><![CDATA[<a title="[Neues Fenster]" target="_blank" href="http://www.google.de">]]></data>
                <data id="d5e5-end"><![CDATA[</a>]]></data>
                <data id="d5e8-start"><![CDATA[<p>]]></data>
                <data id="d5e8-end"><![CDATA[</p>]]></data>
                <data id="d8e2-start"><![CDATA[<p>]]></data>
                <data id="d8e2-end"><![CDATA[</p>]]></data>
                <data id="d17e1-start"><![CDATA[<p>]]></data>
                <data id="d17e1-end"><![CDATA[</p>]]></data>
                <data id="d17e3-start"><![CDATA[<i>]]></data>
                <data id="d17e3-end"><![CDATA[</i>]]></data>
                <data id="d17e5-start"><![CDATA[<a title="[Neues Fenster]" target="_blank" href="http://www.google.de">]]></data>
                <data id="d17e5-end"><![CDATA[</a>]]></data>
                <data id="d17e8-start"><![CDATA[<p>]]></data>
                <data id="d17e8-end"><![CDATA[</p>]]></data>
             </originalData>
             <segment id="d5e5-href">
                <source>http://www.google.de</source>
             </segment>
             <segment id="d5e5-title">
                <source>[Neues Fenster]</source>
             </segment>
             <segment id="d17e5-href">
                <source>http://www.google.de</source>
             </segment>
             <segment id="d17e5-title">
                <source>[Neues Fenster]</source>
             </segment>
             <segment id="page.title">
                <source>Ich bin die Seitenüberschrift.</source>
             </segment>
             <segment id="page.intro">
                <source>Ich bin der Intro-Text einer Seite. Ich bin optional. Ich kann <pc id="d5e2" dataRefStart="d5e2-start" dataRefEnd="d5e2-end">HTML-Markup</pc> und <pc id="d5e5" dataRefStart="d5e5-start" dataRefEnd="d5e5-end">Links</pc> enthalten und aus mehreren Absätzen bestehen.<pc id="d5e8" dataRefStart="d5e8-start" dataRefEnd="d5e8-end"> Auf dieser Seite befinden sich nur einfache Textfelder. Die verschiedenen Intro-/Outro-Texte und die verschiedenen Beschriftungspositionen gibt es bei allen Feldtypen.</pc>
                </source>
             </segment>
    		   </unit>
               </file>
    </xliff>

    Hi Paul, sure. Here is the file. The problem appears after the CDATA section

    emoji
  • Hi

    Thanks for the file.  I think this is a bug so I have logged a support case (00654964) for it.

    I also tried to work around this using the Multilingual XML filetype and this works to resolve the segmentation problem you have shown here.  But then introduces another segmentation problem because we cannot segment the CDATA sections using the embedded content filter and this is a current limitation of the API.

    So unfortunately I think we're stuck until these problems are resolved by the core development team.  I'll come back to you if I learn that I'm mistaken and the problem can be resolved, or once I know the bug number so we can track progress.

    One possible solution, although a bit trickier, would be this:

    1. use regex to add a target element that contains the content of your source element
    2. create a new custom XML filetype that handles the target element only

    When you translate and save the target you'll have a fully translated bilingual XLIFF 2.1 file.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  • It seems that the behaviour is actually considered to be correct.  The technical support explained this quite well and I worked through the steps using your file again below.

    After some discussion about the expected behavior for segmentation on the XLIFF 2.0 files, we agreed that what the standard says is that agents should resegment, unless the canResegment attribute says otherwise:

    http://docs.oasis-open.org/xliff/xliff-core/v2.0/xliff-core-v2.0.html#canResegment

    Since the "canResegment" attribute will be treated hierarchically, all the segments in a <unit> must have it set to "yes". By default, it is set to "yes". So once a segment has it as "no", segmentation process doesn't take place. 

    Because of this merging, in order to avoid concatenation of neighboring segments, it is recommended each segment to end with whitespace.

    What this means is that Trados will "resegment" all of the content inside the <unit> elements (so all <segment> elements) if the "canResegment" attribute is set to "yes" or is missing (the default is set to "yes").  In your file it's missing.

    (STEP1) In this specific example, Trados merges all of the content into two Translation units:

    no.1 -> Ich bin der Titel des Formulars.
    no.2 -> http://www.google.de[Neues Fenster]http://www.google.de[Neues Fenster]Ich bin die Seitenüberschrift.Ich bin der Intro-Text einer Seite. Ich bin optional. Ich kann HTML-Markup und Links enthalten und aus mehreren Absätzen bestehen. Auf dieser Seite befinden sich nur einfache Textfelder. Die verschiedenen Intro-/Outro-Texte und die verschiedenen Beschriftungspositionen gibt es bei allen Feldtypen.

    (STEP2) Then the default language processing rule (or a TM if you have one), and the segmentation rules defined there (ie: sentence-based segmentation) kick in. So the 2nd "merged" TU is split in to smaller segments according to the TM segmentation; so you end up having in the Editor:

    no.1 -> Ich bin der Titel des Formulars.
    no.2 -> http://www.google.de[Neues Fenster]http://www.google.de[Neues Fenster]Ich bin die Seitenüberschrift.Ich bin der Intro-Text einer Seite. (because the whitespace is missing after the word "Seitenüberschrift.", segmentation doesn't occur here \ also specified in the documentation provided above)
    no.3 -> Ich bin optional.
    no.4 -> Ich kann HTML-Markup und Links enthalten und aus mehreren Absätzen bestehen.
    no.5 -> Auf dieser Seite befinden sich nur einfache Textfelder.
    no.6 -> Die verschiedenen Intro-/Outro-Texte und die verschiedenen Beschriftungspositionen gibt es bei allen Feldtypen.

    Like this:

    Screenshot of Trados Studio showing merged content into two translation units with highlighted HTML markup tags and links.

    If the "canResegment" attribute is added to the file and its value is set to "no", then the filter won't merge the TU "segment" elements.  So I edited the <file> element like this:

    <file id="file" canResegment="no" >

    STEP1 will no longer occur, so you'll see:

    no.1 -> Ich bin der Titel des Formulars.
    no.2 -> http://www.google.de
    no.3 -> [Neues Fenster]
    no.4 -> http://www.google.de
    no.5 -> [Neues Fenster]
    no.6 -> Ich bin die Seitenüberschrift.
    no.7 -> Ich bin der Intro-Text einer Seite. Ich bin optional. Ich kann HTML-Markup und Links enthalten und aus mehreren Absätzen bestehen. Auf dieser Seite befinden sich nur einfache Textfelder. Die verschiedenen Intro-/Outro-Texte und die verschiedenen Beschriftungspositionen gibt es bei allen Feldtypen.

    Like this:

    Screenshot of Trados Studio displaying content segmented into multiple translation units with 'canResegment' attribute set to 'no'.

    But now you've lost the segmentation you had in the first place that would be handled by the TM and have a large chunk of text in one segment... number 7.  So each segment in the XIFF file is treated as a complete segment and not segmented at all.

    A potential solution here is to make sure that the translatable content in each <segment> element ends in a whitespace to avoid concatenation but this will not fix the "segment merging" of the content that doesn't end with a full stop (or other punctuation mark that enforces segmentation) - but it would still look better in the editor because it will be separated by space, not "glued" together.

    So, I then decided to take a look at how memoQ would handle these two files seeing as you mentioned it.  I created a view for both my test files (your original and the one I added canResegment="no" to) and see this:

    Screenshot of Trados Studio editor with content listed in numbered segments, including URLs and HTML markup, without visible errors or warnings.

    Both examples are handled exactly the same way, whether I use canResegment or not.  It appears that they treat the XLIFF as canResegment="no" as the second example I showed above in Trados Studio.  I believe this is actually incorrect because they have ignored the specification defaults for segmentation.

    I hope this helps clarify this situation.  There is no bug.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 5:46 AM (GMT 0) on 29 Feb 2024]
  • In the meantime!!

    We have been working on improving the ability of the Multilingual XML filetype to segment some of these tricky areas and now I can do this:

    Screenshot of Trados Studio software showing a side-by-side comparison of source and target text segments with Multilingual XML FileType version 1.0.0.0. No visible errors or warnings.

    I can do this without caring about the rules for XLIFF 2.0 at all and I now get the perfect solution for you, better than any of the workarounds and better than memoQ ;-)  We are still wrapping up some testing before we release to the appstore but this is a really neat solution I think.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 5:46 AM (GMT 0) on 29 Feb 2024]
  • In fact I'm wrong... good job we're testing still!  We're duplicating the content in the CDATA section!  We'll fix that too!

    I was too excited too soon!

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  • ok - the updated version is live on the appstore as of yesterday, and handles this scenario properly:

    Trados Studio interface showing a translation project with Multilingual XML File Type version 1.0.0.0. Text segments are displayed with source and target columns, including URLs and text fields.

    My initial comments on the use of XLIFF 2.0 are still valid, but using the Multilingual XML filetype will allow you to ignore the specification and just handle the file as you'd like.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 5:46 AM (GMT 0) on 29 Feb 2024]
  • Great, thanks a lot! I just downloaded the multilingual XML filetype from the AppStore and I could not open my file. Is there any settings I have to do?

    emoji
  • Is there any settings I have to do?

    Of course.  This is a very flexible filetype that allows all kinds of multilingual filetypes to handled.  Here's what I used for your file:

    Trados Studio settings window showing Multilingual XML file type with a warning to make sure *.xliff is included in the file dialog wildcard expression.

    Trados Studio language mapping settings with a note to add the path to the languages root, indicating English (United Kingdom) as target and German (Germany) as source.

    I think that will do it for the sample you provided.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 5:46 AM (GMT 0) on 29 Feb 2024]