Please weigh in on the XLIFF standard and how it affects the Translation Memory workflow

I would like to ask the community to weigh in on the XLIFF standard. It occurs to me that the development of the bilingual XLIFF files we often see in WordPress exports are not at all optimized for Translation Memory workflows. A couple of issues (let's assume all examples define English as the source language):

  1. XLIFF files by its bilingual nature are pre-segmented by each instance of <source> and <target> content. I understand the practical limitation of Studio's full stop segmentation rules because of the bilingual nature of the file. I've read elsewhere on SDL's community page that users report very long paragraphs extracted. This is our experience as well. This makes working with segmented Translation Memories impossible. The only resolution here seems to be to manually split segments which would be a cumbersome process. 
  2. Although the XLIFF standard doesn't seem to have provisions for embedded content or CDATA support, the reality is that WordPress exports do generate HTML content and WordPress shortcodes. Studio's XLIFF filter does not allow processing embedded content. The workaround typically is to simply use a custom XML filter. This works great, but it does require you to process the <target> segments because XML filters replaces text. It does not map the source and target in Studio like it does with the XLIFF filter. Using the XML filter also takes care of segmentation.
    • I understand that there are some standards around tag element conversion in the native XLIFF file that may help (have not tested it yet) but it doesn't resolve the segmentation issue. 
  3. The XLIFF standard defines that <target> should always contain the latest translation. When the source content gets updated, we receive an updated <source> English and pre-populated old <target> translation. This wreaks total havoc with our XML filter because we lose the ability to process <target> to get the latest source. Processing the source is an option, but it replaces <source>. Manual workaround would be needed. There are several creative ways I could foresee. One options is to strip the source XLIFF file from the <target> entities, process the <source> segments using our XML filter, translate with the help of our TM and then export out to XLIFF. Now what we have is a XLIFF file with target translations in <source>. The manual workaround would be to replace <source> with <target> and </source> with </target>. Now the only thing you need to do is somehow replace the English <source> for each item. Maybe a compare of the files could work where you reject the deletion of the original source and accept the addition of the target. However, how cumbersome is that in case you deal with many files?

In conclusion: Right now the only productive process I can figure out is the XML route for original content as long as <target> is populated with English source content. However, updated files become a cumbersome process.

I've tried to research the history behind XLIFF a little bit but I can't necessarily determine whether Translation Memory workflow was ever considered in developing this standard. The bilingual nature of XLIFF may be very helpful to developers and certainly extends translation to a wider audience, but I feel that this standard could have a negative impact on translation quality by LSPs. XLIFF seems to take us away from having the Translation Memory (MT, TM, etc) as the central database for all translation management. Nowhere else in translation management do we typically deal with bilingual files because we manage translation and translation updates from the English source at all times.

I hope SDL can weigh in on this on best practices in working with XLIFF and any translation professional or LSP to weigh in on the impact it has on your workflow. I'm looking particularly for ideas that either remedy the problems I indicated with XLIFF or best practices on workarounds, including perhaps custom filters. I found one post that mentioned a filter that handles embedded content in XLIFF but the link is down. Even if that is resolved, I don't foresee that the segmentation issue realistically can be resolved in bilingual files.

Many thanks! 

Jeroen Tetteroo

Language Solutions

St. Louis, MO

Using Studio 2017, 2014 Professional 

Parents
  • Unknown said:
    In conclusion: Right now the only productive process I can figure out is the XML route for original content as long as <target> is populated with English source content. However, updated files become a cumbersome process.

    Hi Jeroen,

    Perhaps a way forward, for now (messy I know), is this:

    1. Open the XLIFF files  and update into a TM all the segments that contain target translation
    2. Copy the source content to the target elements using regex in a text editor
    3. Translate the target elements as XML, update from the TM you created in 1.

    Ideally the solution is for us to provide better support for embedded content.  We did start work on this quite a while ago but got such little feedback on the Betas that we dropped it temporarily and in all honesty it still only rears its head now and again so the interest only seems to come from a few people.

    However, we do need to do this.  I looked at the old Beta in Studio 2017 last month and it doesn't work at all, so we have two options:

    1. Schedule the work to fix it and complete it
    2. Core Studio Development address this properly with the built in XLIFF filetype

    Give the workload our appstore team has at the moment I know we won't get to 1. for quite a while, and I know that core development have been looking at 2. for a few months.  So I'm hopeful we might have something from them sooner and then we'd get around to 1.

    On the XLIFF specification itself.  I'm not expert but I do believe using CDATA is perfectly allowable.  Some companies producing XLIFF do create awful XLIFF and would probably do us all a favour if they could use multilingual XML instead.  But that's probably a problem of the industry telling everyone to use XLIFF as the exchange files and then not locking down how it's used.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

Reply
  • Unknown said:
    In conclusion: Right now the only productive process I can figure out is the XML route for original content as long as <target> is populated with English source content. However, updated files become a cumbersome process.

    Hi Jeroen,

    Perhaps a way forward, for now (messy I know), is this:

    1. Open the XLIFF files  and update into a TM all the segments that contain target translation
    2. Copy the source content to the target elements using regex in a text editor
    3. Translate the target elements as XML, update from the TM you created in 1.

    Ideally the solution is for us to provide better support for embedded content.  We did start work on this quite a while ago but got such little feedback on the Betas that we dropped it temporarily and in all honesty it still only rears its head now and again so the interest only seems to come from a few people.

    However, we do need to do this.  I looked at the old Beta in Studio 2017 last month and it doesn't work at all, so we have two options:

    1. Schedule the work to fix it and complete it
    2. Core Studio Development address this properly with the built in XLIFF filetype

    Give the workload our appstore team has at the moment I know we won't get to 1. for quite a while, and I know that core development have been looking at 2. for a few months.  So I'm hopeful we might have something from them sooner and then we'd get around to 1.

    On the XLIFF specification itself.  I'm not expert but I do believe using CDATA is perfectly allowable.  Some companies producing XLIFF do create awful XLIFF and would probably do us all a favour if they could use multilingual XML instead.  But that's probably a problem of the industry telling everyone to use XLIFF as the exchange files and then not locking down how it's used.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

Children
No Data