Please weigh in on the XLIFF standard and how it affects the Translation Memory workflow

I would like to ask the community to weigh in on the XLIFF standard. It occurs to me that the development of the bilingual XLIFF files we often see in WordPress exports are not at all optimized for Translation Memory workflows. A couple of issues (let's assume all examples define English as the source language):

XLIFF files by its bilingual nature are pre-segmented by each instance of <source> and <target> content. I understand the practical limitation of Studio's full stop segmentation rules because of the bilingual nature of the file. I've read elsewhere on SDL's community page that users report very long paragraphs extracted. This is our experience as well. This makes working with segmented Translation Memories impossible. The only resolution here seems to be to manually split segments which would be a cumbersome process.
Although the XLIFF standard doesn't seem to have provisions for embedded content or CDATA support, the reality is that WordPress exports do generate HTML content and WordPress shortcodes. Studio's XLIFF filter does not allow processing embedded content. The workaround typically is to simply use a custom XML filter. This works great, but it does require you to process the <target> segments because XML filters replaces text. It does not map the source and target in Studio like it does with the XLIFF filter. Using the XML filter also takes care of segmentation.
- I understand that there are some standards around tag element conversion in the native XLIFF file that may help (have not tested it yet) but it doesn't resolve the segmentation issue.
The XLIFF standard defines that <target> should always contain the latest translation. When the source content gets updated, we receive an updated <source> English and pre-populated old <target> translation. This wreaks total havoc with our XML filter because we lose the ability to process <target> to get the latest source. Processing the source is an option, but it replaces <source>. Manual workaround would be needed. There are several creative ways I could foresee. One options is to strip the source XLIFF file from the <target> entities, process the <source> segments using our XML filter, translate with the help of our TM and then export out to XLIFF. Now what we have is a XLIFF file with target translations in <source>. The manual workaround would be to replace <source> with <target> and </source> with </target>. Now the only thing you need to do is somehow replace the English <source> for each item. Maybe a compare of the files could work where you reject the deletion of the original source and accept the addition of the target. However, how cumbersome is that in case you deal with many files?

In conclusion: Right now the only productive process I can figure out is the XML route for original content as long as <target> is populated with English source content. However, updated files become a cumbersome process.

I've tried to research the history behind XLIFF a little bit but I can't necessarily determine whether Translation Memory workflow was ever considered in developing this standard. The bilingual nature of XLIFF may be very helpful to developers and certainly extends translation to a wider audience, but I feel that this standard could have a negative impact on translation quality by LSPs. XLIFF seems to take us away from having the Translation Memory (MT, TM, etc) as the central database for all translation management. Nowhere else in translation management do we typically deal with bilingual files because we manage translation and translation updates from the English source at all times.

I hope SDL can weigh in on this on best practices in working with XLIFF and any translation professional or LSP to weigh in on the impact it has on your workflow. I'm looking particularly for ideas that either remedy the problems I indicated with XLIFF or best practices on workarounds, including perhaps custom filters. I found one post that mentioned a filter that handles embedded content in XLIFF but the link is down. Even if that is resolved, I don't foresee that the segmentation issue realistically can be resolved in bilingual files.

Many thanks!

Jeroen Tetteroo

Language Solutions

St. Louis, MO

Using Studio 2017, 2014 Professional

Translate

Rate translation

Suggest better translation

Moderator UI

Thread Subject & Description
Please weigh in on the XLIFF standard and how it affects the Translation Memory workflow I would like to ask the community to weigh in on the XLIFF standard. It occurs to me that the development of the bilingual XLIFF files we often see in WordPress exports are not at all optimized for Translation Memory workflows. A couple of issues (let's assume all examples define English as the source language): XLIFF files by its bilingual nature are pre-segmented by each instance of <source> and <target> content. I understand the practical limitation of Studio's full stop segmentation rules because of the bilingual nature of the file. I've read elsewhere on SDL's community page that users report very long paragraphs extracted. This is our experience as well. This makes working with segmented Translation Memories impossible. The only resolution here seems to be to manually split segments which would be a cumbersome process. Although the XLIFF standard doesn't seem to have provisions for embedded content or CDATA support, the reality is that WordPress exports do generate HTML content and WordPress shortcodes. Studio's XLIFF filter does not allow processing embedded content. The workaround typically is to simply use a custom XML filter. This works great, but it does require you to process the <target> segments because XML filters replaces text. It does not map the source and target in Studio like it does with the XLIFF filter. Using the XML filter also takes care of segmentation. I understand that there are some standards around tag element conversion in the native XLIFF file that may help (have not tested it yet) but it doesn't resolve the segmentation issue. The XLIFF standard defines that <target> should always contain the latest translation. When the source content gets updated, we receive an updated <source> English and pre-populated old <target> translation. This wreaks total havoc with our XML filter because we lose the ability to process <target> to get the latest source. Processing the source is an option, but it replaces <source>. Manual workaround would be needed. There are several creative ways I could foresee. One options is to strip the source XLIFF file from the <target> entities, process the <source> segments using our XML filter, translate with the help of our TM and then export out to XLIFF. Now what we have is a XLIFF file with target translations in <source>. The manual workaround would be to replace <source> with <target> and </source> with </target>. Now the only thing you need to do is somehow replace the English <source> for each item. Maybe a compare of the files could work where you reject the deletion of the original source and accept the addition of the target. However, how cumbersome is that in case you deal with many files? In conclusion: Right now the only productive process I can figure out is the XML route for original content as long as <target> is populated with English source content. However, updated files become a cumbersome process. I've tried to research the history behind XLIFF a little bit but I can't necessarily determine whether Translation Memory workflow was ever considered in developing this standard. The bilingual nature of XLIFF may be very helpful to developers and certainly extends translation to a wider audience, but I feel that this standard could have a negative impact on translation quality by LSPs. XLIFF seems to take us away from having the Translation Memory (MT, TM, etc) as the central database for all translation management. Nowhere else in translation management do we typically deal with bilingual files because we manage translation and translation updates from the English source at all times. I hope SDL can weigh in on this on best practices in working with XLIFF and any translation professional or LSP to weigh in on the impact it has on your workflow. I'm looking particularly for ideas that either remedy the problems I indicated with XLIFF or best practices on workarounds, including perhaps custom filters. I found one post that mentioned a filter that handles embedded content in XLIFF but the link is down. Even if that is resolved, I don't foresee that the segmentation issue realistically can be resolved in bilingual files. Many thanks! Jeroen Tetteroo Language Solutions St. Louis, MO Using Studio 2017, 2014 Professional
Get AI Suggestion

AI Reply

Accept answer Reject Answer

Jerzy Czopik over 8 years ago

Thank you for raising this up.

From how I understand the XLIFF standard (http://docs.oasis-open.org/xliff/xliff-core/xliff-core.html)- and I don't think it is only me - using anything else than defined there makes the file not conform with the standard. So using any CDATA elements, which is obviously done often, is against the standard. Such file is not a proper XLF. When we encounter such files, we do same as you - process it with an XML file type.

So I don't think this is a problem of Studio, but a problem with the source file.

This said, I also think it would be great, if Studio could allow cascading of filters and running html parser after the xlf pasrer.

However, maybe some more enlightened colleagues will have better explanations and/or solutions.

_________________________________________________________

When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Paul over 8 years ago in reply to Jerzy Czopik

Unknown said:
So using any CDATA elements, which is obviously done often, is against the standard.

This is incorrect Jerzy. It is acceptable to use CDATA in XLIFF:

http://docs.oasis-open.org/xliff/v1.2/xliff-profile-html/xliff-profile-html-1.2-cd02.html#General_CDATA

Paul Filkin | RWS Group

________________________
Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Jerzy Czopik over 8 years ago in reply to Paul

This is incorrect Jerzy. It is acceptable to use CDATA in XLIFF:

So then why it is not possible to have a parser applied for that in the XLIFF file type? I was assuming this was because CDATA shouldn't be there...

_________________________________________________________

When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Paul over 8 years ago
Unknown said:
In conclusion: Right now the only productive process I can figure out is the XML route for original content as long as <target> is populated with English source content. However, updated files become a cumbersome process.

Hi Jeroen,

Perhaps a way forward, for now (messy I know), is this:

Open the XLIFF files and update into a TM all the segments that contain target translation

Copy the source content to the target elements using regex in a text editor

Translate the target elements as XML, update from the TM you created in 1.

Ideally the solution is for us to provide better support for embedded content. We did start work on this quite a while ago but got such little feedback on the Betas that we dropped it temporarily and in all honesty it still only rears its head now and again so the interest only seems to come from a few people.

However, we do need to do this. I looked at the old Beta in Studio 2017 last month and it doesn't work at all, so we have two options:

Schedule the work to fix it and complete it

Core Studio Development address this properly with the built in XLIFF filetype

Give the workload our appstore team has at the moment I know we won't get to 1. for quite a while, and I know that core development have been looking at 2. for a few months. So I'm hopeful we might have something from them sooner and then we'd get around to 1.

On the XLIFF specification itself. I'm not expert but I do believe using CDATA is perfectly allowable. Some companies producing XLIFF do create awful XLIFF and would probably do us all a favour if they could use multilingual XML instead. But that's probably a problem of the industry telling everyone to use XLIFF as the exchange files and then not locking down how it's used.

Regards

Paul

Paul Filkin | RWS Group

________________________
Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Paul over 8 years ago in reply to Jerzy Czopik

Unknown said:
So then why it is not possible to have a parser applied for that in the XLIFF file type?

Technical difficulties Jerzy. It has nothing to do with anything other than technical challenges. We have asked the filetype team for this for years and if it was easy we would certainly have done it by now! I believe it's the bilingual nature of the filetype that causes the difficulties for us, but in reality where the problem comes from is irrelevant to me. I know it's hard because when my team tried to do this by extending the XLIFF it did present some very complex problems and this is also why we did not finish this. We probably would have got there but the lack of interest from anyone made it easier to do other things that people actually wanted.

Paul Filkin | RWS Group

________________________
Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Melissa Wurst over 8 years ago in reply to Paul

Thanks to both for the great conversation. I appreciate the workaround Paul and that may work in a pinch. However, I would still like to highlight that XLIFF also has a major problem with segmentation. I found out that the standard does allow segmentation using <mrk mtype="seg"> in the code. However, I take it that this would have to be supported by the platform that provides the XLIFF file. I have not seen this used in WordPress exports yet and I don't know how easy it is to provide files automatically with this type of segmentation. See: http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_Segmentation

I can speculate about possible other workarounds that does not include having to provide an XLIFF file with the segmentation coding, but I'm not sure if SDL would consider even looking at that. I think you'd have to do something in the background where you process source and target, then clear target perhaps and use a second processing step to segment out the source text using the full-stop rule (or whatever is segmentation set in the Translation Memory setting). The bilingual nature of XLIFF just really complicates this process.

Regarding CDATA: I agree with both that 1. CDATA support is part of the standard, but that 2. there do not seem to be any provisions for working with HTML embedded content. As far as I understand, CDATA typically is used because of the need to preserve coding in text. So it can be expected that if you find CDATA, you will find code. I did find some support for inline elements but the application seems limited: http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Specs_Elem_Inline

Both solutions provided by the OASIS standard seem cumbersome. I'm not at all convinced XLIFF ever was intended for the professional translation industry or these sort of considerations would have been taken into account. And given the fact that many platforms fail to adhere to the standards already by providing messy XLIFF files, I wonder if we would ever get there.

I like the idea of SDL considering to work on a test filter again and I'm happy to be a beta tester for this. I'm surprised that there is so little feedback from other users because the problem should be so obvious to CAT tool users. However, maybe a motivating factor to move forward with this is that we are working together with the WPML team to get their WordPress plugin to work with independent LSPs like us. It's because of our work with them in testing their XLIFFs that the issue of the standard came up. I assume more LSPs will join WPML once they are offering this to independent LSPs and that may increase the need for proper Studio support.

Thanks!

Jeroen (Language Solutions team)
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Paul over 8 years ago in reply to Melissa Wurst

Unknown said:
However, I would still like to highlight that XLIFF also has a major problem with segmentation. I found out that the standard does allow segmentation using <mrk mtype="seg"> in the code. However, I take it that this would have to be supported by the platform that provides the XLIFF file.

Hi Jeroen,

I forgot about that bit. I think checking this option prior to processing the file would help?

Paul Filkin | RWS Group

________________________
Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Melissa Wurst over 8 years ago in reply to Paul

Thanks! Is that when the XLIFF has the segmentation information? I don't think this setting affected my file from what I remember from testing. The file we tested does not have any of that data <mrk mtype="seg">. I would have to check with the WPML team to see if they can set that up automatically. I "guess" we could do a find/replace action in the XLIFF file if we have to (. [A-Z] or something like that for for .<mrk mtype="seg">.)

I'll do some filter preview in Studio 2017 (love that feature!)
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Uta Moncur over 7 years ago in reply to Melissa Wurst

I have now read all the discussions around XLIFF from WPML and while I agree that it appears to be an issue of what WPML is outputting, it would be great if Studio had a filter that would handle it better. I see that I can enable embedded content processing, but the large segments are a big issue. We would really like to be able to process WPML files without a ton of tech workarounds. Would be happy to beta test any new filters. I have a WP sandbox and WPML Translation Hub stage set up currently. Thanks!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
Evzen Polenka over 7 years ago in reply to Uta Moncur

You really should "bark the right tree" in the first place, i.e. complain by WPML authors.
Without enough people complaining by THEM it's highly unlikely that anything will be changed.
There is plenty of more important things to be done than working around someone else's screwups.

Not mentioning that the entire WPML workflow is pretty illogical - as mentioned in the original post, when the source text gets updated, WPML exports the NEW source text with OLD target text... which is totally weird and just wrong (it's probably meant as some kind of "TM fuzzy-match suggestion" for the translator... i.e. shows that the authors had really no idea about localization processes when designing it all).
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Trados Studio > 1. Trados Studio

Please weigh in on the XLIFF standard and how it affects the Translation Memory workflow