How to fix very-large-segments in Studio in xliff projects from WordPress/WPML

Hi Team, 

I was wondering whether someone could help with a recent issue we're having. 

Essentially, we get website translation projects in the form of xliff files, generated from WordPress with WPML. There seems to be a consistent issue with such xliffs in Studio in that the translation projects are not being broken down into segments properly. The whole file becomes just a few segments, with lots of line spaces in between whole blocks of text.

The consequences of this are not nice: you can't use MT, you can't use your TMs, you can't even translate anything manually either, for fear of damaging the file so it won't fit back into the website. 

I don't want to make any unfair comparisons but in the interest of research the following might also be useful to pre-empt: I initially thought the segmentation error could be a WPML issue. But then I've put the same xliff into one of popular alternative CATs and hey - what was 3 segments in Studio 2017 (one big blob plus 2 one-liners at the bottom) turned out to be 94 proper segments in the other CAT! I suppose this makes WordPress / WPML innocent enough.  

I admit I noticed some relevant information on this Forum regarding legacy Studio 2014 but that piece of advice is thankfully very now old - because it seemed to be a multi-stage workaround of incredible technical complexity. To be honest I'd rather need a real-life fix which simply works, just like the other CAT does it. Surely, there must be some setting deep inside the newest Studio that I'm not noticing? 

Please would anyone be able to advise the quickest way to correct that segmentation issue?

Many thanks indeed, 

Adam

  • WordPress is a quite bad format. What I do, if I need to work with it, is to try to adapt the file type including the tag types to have them rule the segmentation. From what I remember WPML is a XLIFF full of CDATA... Last time I had that I developed a customized file type to handle this as a "normal" xml file. Might be, that this could be helpful.
    Or you just let the segmentation be done in the other popular CAT tool, then take the file from there and translate in Studio.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Hi Jerzy,

    Thank you for your very quick reply.

    1. What do you mean by saying that WPML makes bad xliffs? Yes, there are CDATA tags in there but they are are just marked out as any other tag and they don't seem to interfere with anything. Is there something WPML should work on?

    2. What do you mean by "handle this as a normal XML file" exactly? I apologise but that might be not obvious enough for our level of expertise to solve my problem. Is that a functionality in Studio?

    3. I tried the final option you suggest and failed: open in the "other CAT" (nice segments) export out and open in Studio (no segments).

    So essentially, how do you think I could easily fix this problem in Studio?

    Many thanks indeed,

    Adam
  • Unknown said:
    There seems to be a consistent issue with such xliffs in Studio in that the translation projects are not being broken down into segments properly.

    This is NOT Trados Studio problem! This is caused PURELY by TOTAL IGNORANCE of the WPML developer (or the entire WP squad)!

    What they call an XLIFF is in fact a TOTAL CRAP which has - basically apart from the file extension - nothing in common with XLIFF. They just get the complete HTML content as-is and put it as CDATA section into SINGLE LAAAARGE translation unit.
    The IGNORANTS apparently did not get the fundamentals of XLIFF at all :(
    wpml.org/.../

    There is absolutely NOTHING Studio can do about that!

  • Hello Evzen,

    I very much appreciate your contribution. Indeed, I do observe exactly what you're describing: a single large translation unit which you can't do anything with.

    However, if you imagine yourself in my situation, I don't own WPML. I can't change how they produce xliffs.

    I only own an LSP that tries to serve some very decent clients, none of whom wish to entertain the various detailed IT problems. They just want their website professionally localised into their desired languages, preferably very quickly, and you really can't blame them.

    On the other hand, I very much wish to stay entirely loyal to the SDL Studio / GroupShare suite, as a business principle.

    Therefore, I'm a bit stuck here: it is confusing to me how my CAT platform of choice seems to be unable to support a straightforward website translation project (if you quote my clients) which was taken from world's most popular website management system (that's undisputed).

    While a competitive CAT tool seems to be able to simply crack on with it on default settings.

    It is perhaps the tragedy of my situation that I can't afford to appreciate the wonderful technicalities and the different flavours of this or other tag inside some computer file. Put bluntly, the idea behind Studio must be to perform translation projects that real-world clients wish to pay for. And what I'm observing is a situation that is... commercially unhelpful.

    Therefore, let me cry for help again: is everyone here absolutely sure nothing can be done to carry out WordPress / WPML projects with Studio 2017 (in terms of fixing the very-large-segment issue)?

    Kind regards,

    Adam
  • 1. and 2. From what I remember, WPML delivers tags in <> and in []. Studio can recognize only tags written in <>. For all contents in [] one needs to define embedded content rules.
    So in case of XLF I told Studio to open it like XML and translate only the content of "target" tags. Then I defined corresponding embedded content rules.
    However, from the last CU, Studio 2017 will allow you to process embedded content in XLF file, so this will replace my process then.

    3. If you preprocess the file in MemoQ, you must use the option "Export bilingual" - "Plain XLF for other CAT tools". And then you must make sure, your Studio has the file type for MemoQ (not all versions did have it out of the box).

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Unknown said:
    I only own an LSP that tries to serve some very decent clients, none of whom wish to entertain the various detailed IT problems. They just want their website professionally localised into their desired languages, preferably very quickly, and you really can't blame them.

    Of course I can... because the ONLY way out of this is that major WP(ML) users PUSH VERY STRONG on the lame WP(ML) developers to fix the BAD functionality, or simply stop using their crappy products. Period.
    Ignorant product users are as bad as ignorant product developers, there is absolutely NO difference.

    Unknown said:
    it is confusing to me how my CAT platform of choice seems to be unable to support a straightforward website translation project (if you quote my clients) which was taken from world's most popular website management system (that's undisputed).

    The ONLY undisputed fact is that this "world's most popular system" is BADLY DESIGNED and produces CRAP, not proper XLIFFs. Period. And ignorant clients' claims WON'T change anything on that.

    Unknown said:
    While a competitive CAT tool seems to be able to simply crack on with it on default settings.

    I simply DON'T believe that. Crippled XLIFF is crippled XLIFF and CANNOT be correctly processed as XLIFF.
    Extracting the complete HTML content embedded inside the wannabe-XLIFF and re-processing, re-parsing and segmenting it from scratch is a completely different story and nothing else than WORKAROUND and does NOT mean anything like "that tool can process WPML XLIFFs while the other tool can't" (no matter if ignorant users see it like this).

    Unknown said:
    It is perhaps the tragedy of my situation that I can't afford to appreciate the wonderful technicalities and the different flavours of this or other tag inside some computer file. Put bluntly, the idea behind Studio must be to perform translation projects that real-world clients wish to pay for. And what I'm observing is a situation that is... commercially unhelpful.

    This is just completely WRONG.
    If someone starts producing traffic lights which work the exactly opposite way than the rest of the world (green for STOP, red for GO), would you ask drivers to accommodate to this, or would you push the producer to fix their product (and not use the crippled traffic lights until they fix the problem)?!?!

    Unknown said:
    Therefore, let me cry for help again: is everyone here absolutely sure nothing can be done to carry out WordPress / WPML projects with Studio 2017 (in terms of fixing the very-large-segment issue)?

    As Jerzy wrote, you're on your own - create customized file type which parses the HTML embedded inside the wannabe-XLIFF... just as I described above.
    If you don't like this, bark the right tree, i.e. blame the WPML producer.

  • Hi ,

    Always an emotive topic this one because on one hand it's very poor practice to put markup inside CDATA sections for XLIFF files as they are intended to be interpreted as plain text, and on the other many providers of XLIFF do exactly this. The cynical side of me would say this is because it's easier to just dump the content into a CDATA section than it is to properly consider how it should be handled for localization. When localizing of course you need to be able to interpret the markup so it can be protected, and this is very tricky. An additional problem we have is that XLIFF is a bilingual file and if the content is partially or fully translated then segmenting the content outside of the trans-unit has at least one complication and that is how does the computer know where to segment without understanding the language? Sometimes it might be fairly straightforward, but other times it is not. So segmentation is usually by trans-unit.

    I think Evzen is correct in terms of how other tools handle XLIFF. I spent this evening, coincidentally, setting up a multilingual wordpress blog and installing the WPML plugin just so I could do some testing myself. I then ran the files though other tools as well. memoQ, which you are referring to here actually handles the XLIFF exactly the same as Studio by default. So I had to do three things to make it clear:

    1. apply a segmentation adjustment
    2. force a break on newline in the file to segment the content further
    3. run the regex tagger on all the markup afterwards

    I actually think I was lucky with this one because the CDATA might not have had any newlines in there and then this would not have been possible at all.

    In Studio you can tag all the content afterwards using the Clean Up tasks app which is sort of like regex tagger, but you can't get around the segmentation. The problem here is that we work on an XLIFF file directly (sdlxliff) and not in a database, so resegmentation is tricky.

    In the end I have duly contacted some people at WPML just to see whether they have any plans to do anything to make this easier for translation tool vendors. I was amazed at how many translation related plugins there are for Wordpress so it surely makes sense for them to try and help improve the process a little. Maybe they can address this in XLIFF 2.0 if they intend to support this format.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Unknown said:
    ...and on the other many providers of XLIFF do exactly this.

    The thing is that I have not seen a SINGLE one doing that for a particular reason... they ALL do it simply because they don't have the faintest idea what XLIFF is and how a content should be placed in XLIFF properly.

    Unknown said:
    In the end I have duly contacted some people at WPML just to see whether they have any plans to do anything to make this easier for translation tool vendors. I was amazed at how many translation related plugins there are for Wordpress so it surely makes sense for them to try and help improve the process a little. Maybe they can address this in XLIFF 2.0 if they intend to support this format.

    Haha, that was pretty naive actually.
    They really have NO IDEA about XLIFF, they even admitted it in some forum answer to a localization-knowledgable person asking them for a fix of that ridiculousity... So XLIIF 2.0 is a pure sci-fi... And I hope that they will NEVER EVER even give it a try... because it's guaranteed to be a total fu*kup.

  • Unknown said:
    Haha, that was pretty naive actually.
    They really have NO IDEA about XLIFF, they even admitted it in some forum answer to a localization-knowledgable person asking them for a fix of that ridiculousity... So XLIIF 2.0 is a pure sci-fi...

    I live in hope ;-)

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi,

    I'm one of the developers at OnTheGoSystems, the creators of WPML.

    We decided to deliver this basic solution (dump the content into CDATA blocks without any processing) to help our clients to extract content from the websites and avoid a lot of copy/paste work. We are aware that XLIFF files we generate are not perfect (but still valid according to XLIFF 1.2 specification). We are checking solutions to improve it for some time but still do not have anything that would guarantee that we will be able to recreate translated page (based on the XLIFF with previously extracted and segmented content) without any issues. We do not want to release anything that would generate even more problems for our clients and translators.

    We are really interested to make our clients and translators live easier. As mentioned by Paul, the problem is that it is a very tricky task to extract only text from the content, generate sentences (segmentation), generate trans-units and put it back together when translation is ready. This is the result of how WordPress itself is designed and how data is stored in WordPress database (content mixed with HTML tags). We can not forget that this should work not only for English but also for all other languages (different segmentation rules).

    WPML 3.6 release brought a lot of improvements including better support for content generated by Page Builders. Thanks to that you won't see a lot of shortcodes ([] tags) in your XLIFFs anymore.

    I'm pretty sure that we will finally find also a solution to generate XLIFF files that will make translators happy.

    I will keep you posted :)

    Pawel