How to fix very-large-segments in Studio in xliff projects from WordPress/WPML

Question

Hi Team, 
 I was wondering whether someone could help with a recent issue we're having. 
 Essentially, we get website translation projects in the form of xliff files, generated from WordPress with WPML. There seems to be a consistent issue with such xliffs in Studio in that the translation projects are not being broken down into segments properly . The whole file becomes just a few segments, with lots of line spaces in between whole blocks of text. 
 The consequences of this are not nice: you can't use MT, you can't use your TMs, you can't even translate anything manually either, for fear of damaging the file so it won't fit back into the website. 
 I don't want to make any unfair comparisons but in the interest of research the following might also be useful to pre-empt: I initially thought the segmentation error could be a WPML issue. But then I've put the same xliff into one of popular alternative CATs and hey - what was 3 segments in Studio 2017 (one big blob plus 2 one-liners at the bottom) turned out to be 94 proper segments in the other CAT! I suppose this makes WordPress / WPML innocent enough. 
 I admit I noticed some relevant information on this Forum regarding legacy Studio 2014 but that piece of advice is thankfully very now old - because it seemed to be a multi-stage workaround of incredible technical complexity. To be honest I'd rather need a real-life fix which simply works, just like the other CAT does it. Surely, there must be some setting deep inside the newest Studio that I'm not noticing? 
 Please would anyone be able to advise the quickest way to correct that segmentation issue? 
 Many thanks indeed, 
 Adam

Paul · Answer

Hi , 
 
Always an emotive topic this one because on one hand it's very poor practice to put markup inside CDATA sections for XLIFF files as they are intended to be interpreted as plain text, and on the other many providers of XLIFF do exactly this. The cynical side of me would say this is because it's easier to just dump the content into a CDATA section than it is to properly consider how it should be handled for localization. When localizing of course you need to be able to interpret the markup so it can be protected, and this is very tricky. An additional problem we have is that XLIFF is a bilingual file and if the content is partially or fully translated then segmenting the content outside of the trans-unit has at least one complication and that is how does the computer know where to segment without understanding the language? Sometimes it might be fairly straightforward, but other times it is not. So segmentation is usually by trans-unit. 
 
I think Evzen is correct in terms of how other tools handle XLIFF. I spent this evening, coincidentally, setting up a multilingual wordpress blog and installing the WPML plugin just so I could do some testing myself. I then ran the files though other tools as well. memoQ, which you are referring to here actually handles the XLIFF exactly the same as Studio by default. So I had to do three things to make it clear: 
 
1. apply a segmentation adjustment 
2. force a break on newline in the file to segment the content further 
3. run the regex tagger on all the markup afterwards 
 
I actually think I was lucky with this one because the CDATA might not have had any newlines in there and then this would not have been possible at all. 
 
In Studio you can tag all the content afterwards using the Clean Up tasks app which is sort of like regex tagger, but you can't get around the segmentation. The problem here is that we work on an XLIFF file directly (sdlxliff) and not in a database, so resegmentation is tricky. 
 
In the end I have duly contacted some people at WPML just to see whether they have any plans to do anything to make this easier for translation tool vendors. I was amazed at how many translation related plugins there are for Wordpress so it surely makes sense for them to try and help improve the process a little. Maybe they can address this in XLIFF 2.0 if they intend to support this format. 
 
Regards 
 
Paul

Uta Moncur · Answer

Good news! WPML currently has an easy-to-use web-based tool in beta with some of their LSPs, which converts their XLIFFs to beautifully-segmented monolingual XLIFFs and then back to WPML-suitable XLIFFs. Once it's through beta, it will go into production as part of the process of downloading XLIFFs from Translation Hub. I just tested it successfully. It went so well that I'm wondering if I missed something ;) So, if you were struggling with the WPML segmenting, contact your WPML rep and ask for the segmentation/merge app.

Trados Studio > 1. Trados Studio

How to fix very-large-segments in Studio in xliff projects from WordPress/WPML