Please weigh in on the XLIFF standard and how it affects the Translation Memory workflow

I would like to ask the community to weigh in on the XLIFF standard. It occurs to me that the development of the bilingual XLIFF files we often see in WordPress exports are not at all optimized for Translation Memory workflows. A couple of issues (let's assume all examples define English as the source language):

  1. XLIFF files by its bilingual nature are pre-segmented by each instance of <source> and <target> content. I understand the practical limitation of Studio's full stop segmentation rules because of the bilingual nature of the file. I've read elsewhere on SDL's community page that users report very long paragraphs extracted. This is our experience as well. This makes working with segmented Translation Memories impossible. The only resolution here seems to be to manually split segments which would be a cumbersome process. 
  2. Although the XLIFF standard doesn't seem to have provisions for embedded content or CDATA support, the reality is that WordPress exports do generate HTML content and WordPress shortcodes. Studio's XLIFF filter does not allow processing embedded content. The workaround typically is to simply use a custom XML filter. This works great, but it does require you to process the <target> segments because XML filters replaces text. It does not map the source and target in Studio like it does with the XLIFF filter. Using the XML filter also takes care of segmentation.
    • I understand that there are some standards around tag element conversion in the native XLIFF file that may help (have not tested it yet) but it doesn't resolve the segmentation issue. 
  3. The XLIFF standard defines that <target> should always contain the latest translation. When the source content gets updated, we receive an updated <source> English and pre-populated old <target> translation. This wreaks total havoc with our XML filter because we lose the ability to process <target> to get the latest source. Processing the source is an option, but it replaces <source>. Manual workaround would be needed. There are several creative ways I could foresee. One options is to strip the source XLIFF file from the <target> entities, process the <source> segments using our XML filter, translate with the help of our TM and then export out to XLIFF. Now what we have is a XLIFF file with target translations in <source>. The manual workaround would be to replace <source> with <target> and </source> with </target>. Now the only thing you need to do is somehow replace the English <source> for each item. Maybe a compare of the files could work where you reject the deletion of the original source and accept the addition of the target. However, how cumbersome is that in case you deal with many files?

In conclusion: Right now the only productive process I can figure out is the XML route for original content as long as <target> is populated with English source content. However, updated files become a cumbersome process.

I've tried to research the history behind XLIFF a little bit but I can't necessarily determine whether Translation Memory workflow was ever considered in developing this standard. The bilingual nature of XLIFF may be very helpful to developers and certainly extends translation to a wider audience, but I feel that this standard could have a negative impact on translation quality by LSPs. XLIFF seems to take us away from having the Translation Memory (MT, TM, etc) as the central database for all translation management. Nowhere else in translation management do we typically deal with bilingual files because we manage translation and translation updates from the English source at all times.

I hope SDL can weigh in on this on best practices in working with XLIFF and any translation professional or LSP to weigh in on the impact it has on your workflow. I'm looking particularly for ideas that either remedy the problems I indicated with XLIFF or best practices on workarounds, including perhaps custom filters. I found one post that mentioned a filter that handles embedded content in XLIFF but the link is down. Even if that is resolved, I don't foresee that the segmentation issue realistically can be resolved in bilingual files.

Many thanks! 

Jeroen Tetteroo

Language Solutions

St. Louis, MO

Using Studio 2017, 2014 Professional 

Parents Reply
  • This is incorrect Jerzy.  It is acceptable to use CDATA in XLIFF:

    So then why it is not possible to have a parser applied for that in the XLIFF file type? I was assuming this was because CDATA shouldn't be there...

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

Children
  • Unknown said:
    So then why it is not possible to have a parser applied for that in the XLIFF file type?

    Technical difficulties Jerzy.  It has nothing to do with anything other than technical challenges.  We have asked the filetype team for this for years and if it was easy we would certainly have done it by now!  I believe it's the bilingual nature of the filetype that causes the difficulties for us, but in reality where the problem comes from is irrelevant to me.  I know it's hard because when my team tried to do this by extending the XLIFF it did present some very complex problems and this is also why we did not finish this.  We probably would have got there but the lack of interest from anyone made it easier to do other things that people actually wanted.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Thanks to both for the great conversation. I appreciate the workaround Paul and that may work in a pinch. However, I would still like to highlight that XLIFF also has a major problem with segmentation. I found out that the standard does allow segmentation using <mrk mtype="seg"> in the code. However, I take it that this would have to be supported by the platform that provides the XLIFF file. I have not seen this used in WordPress exports yet and I don't know how easy it is to provide files automatically with this type of segmentation. See: http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_Segmentation

     

    I can speculate about possible other workarounds that does not include having to provide an XLIFF file with the segmentation coding, but I'm not sure if SDL would consider even looking at that. I think you'd have to do something in the background where you process source and target, then clear target perhaps and use a second processing step to segment out the source text using the full-stop rule (or whatever is segmentation set in the Translation Memory setting). The bilingual nature of XLIFF just really complicates this process.

     

    Regarding CDATA: I agree with both that 1. CDATA support is part of the standard, but that 2. there do not seem to be any provisions for working with HTML embedded content. As far as I understand, CDATA typically is used because of the need to preserve coding in text. So it can be expected that if you find CDATA, you will find code. I did find some support for inline elements but the application seems limited: http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Specs_Elem_Inline

     

    Both solutions provided by the OASIS standard seem cumbersome. I'm not at all convinced XLIFF ever was intended for the professional translation industry or these sort of considerations would have been taken into account. And given the fact that many platforms fail to adhere to the standards already by providing messy XLIFF files, I wonder if we would ever get there.

     

    I like the idea of SDL considering to work on a test filter again and I'm happy to be a beta tester for this. I'm surprised that there is so little feedback from other users because the problem should be so obvious to CAT tool users. However, maybe a motivating factor to move forward with this is that we are working together with the WPML team to get their WordPress plugin to work with independent LSPs like us. It's because of our work with them in testing their XLIFFs that the issue of the standard came up. I assume more LSPs will join WPML once they are offering this to independent LSPs and that may increase the need for proper Studio support.

    Thanks!

    Jeroen (Language Solutions team)

  • Unknown said:
    However, I would still like to highlight that XLIFF also has a major problem with segmentation. I found out that the standard does allow segmentation using <mrk mtype="seg"> in the code. However, I take it that this would have to be supported by the platform that provides the XLIFF file.

    Hi Jeroen,

    I forgot about that bit.  I think checking this option prior to processing the file would help?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Thanks! Is that when the XLIFF has the segmentation information? I don't think this setting affected my file from what I remember from testing. The file we tested does not have any of that data <mrk mtype="seg">. I would have to check with the WPML team to see if they can set that up automatically. I "guess" we could do a find/replace action in the XLIFF file if we have to (. [A-Z] or something like that for for .<mrk mtype="seg">.)

    I'll do some filter preview in Studio 2017 (love that feature!)
  • I have now read all the discussions around XLIFF from WPML and while I agree that it appears to be an issue of what WPML is outputting, it would be great if Studio had a filter that would handle it better. I see that I can enable embedded content processing, but the large segments are a big issue. We would really like to be able to process WPML files without a ton of tech workarounds. Would be happy to beta test any new filters. I have a WP sandbox and WPML Translation Hub stage set up currently. Thanks!
  • You really should "bark the right tree" in the first place, i.e. complain by WPML authors.
    Without enough people complaining by THEM it's highly unlikely that anything will be changed.
    There is plenty of more important things to be done than working around someone else's screwups.

    Not mentioning that the entire WPML workflow is pretty illogical - as mentioned in the original post, when the source text gets updated, WPML exports the NEW source text with OLD target text... which is totally weird and just wrong (it's probably meant as some kind of "TM fuzzy-match suggestion" for the translator... i.e. shows that the authors had really no idea about localization processes when designing it all).

  • As I said, I have read all the discussions on this topic in this forum, so I am well aware of your position, Evzen. It seems to me that you are the one barking up the wrong tree by venting your frustration on this forum whenever someone posts on this topic. I do not think this is helpful or productive and might keep folks from continuing to engage in the conversation. Paul has mentioned working with WPML and I am raising the issue with them as well. That doesn't mean I will sit back and wait for them to fix it if there is also a possibility of working around it on the Studio side. As Paul mentioned, they have worked on a filter to address some of it, so why not continue that work at the same time. I will also try the XML route and may need some guidance on it from other forum users. Clearly, several have dealt with this and could work together to help each other out.

    Thanks,

  • Just a follow-up. Heard back very quickly from my contact at WPML. They are working on a solution and I will test it in the next days. I will post with more info at that time. Thanks to anyone before me who has pushed for a better solution, advised and tested with them.
  • My point was that without enough "barking up the WPML" and just keeping doing guerilla workarounds forever (not mentioning even establishing some of these workarounds as standard in Studio!) there wouldn't be much (if any) movement at WPML's side at all.
    I dare to say that actually the "barking" here has played a role in the latest developments at WPML...
  • Maybe. A little bark can go a long way, but too much bark can be counter-productive. Behind these usernames are real humans who might be doing their very best job. Chances are, the one rep of theirs who had commented earlier was scared away from our forum, which in turn may have kept them from notifying us proactively of any beta tests, which means I (and maybe others) wasted a bunch of time looking for other solutions. I was quite apprehensive asking here myself having seen previous interactions.

    Either way... As it turns out, they currently have an easy-to-use web-based tool in beta with some of their LSPs, which converts their XLIFFs to beautifully-segmented monolingual XLIFFs and then back to WPML-suitable XLIFFs. Once it's through beta, it will go into production as part of the process of downloading XLIFFs from Translation Hub. I just tested it successfully. It went so well that I'm wondering if I missed something ;)