How to fix very-large-segments in Studio in xliff projects from WordPress/WPML

Hi Team, 

I was wondering whether someone could help with a recent issue we're having. 

Essentially, we get website translation projects in the form of xliff files, generated from WordPress with WPML. There seems to be a consistent issue with such xliffs in Studio in that the translation projects are not being broken down into segments properly. The whole file becomes just a few segments, with lots of line spaces in between whole blocks of text.

The consequences of this are not nice: you can't use MT, you can't use your TMs, you can't even translate anything manually either, for fear of damaging the file so it won't fit back into the website. 

I don't want to make any unfair comparisons but in the interest of research the following might also be useful to pre-empt: I initially thought the segmentation error could be a WPML issue. But then I've put the same xliff into one of popular alternative CATs and hey - what was 3 segments in Studio 2017 (one big blob plus 2 one-liners at the bottom) turned out to be 94 proper segments in the other CAT! I suppose this makes WordPress / WPML innocent enough.  

I admit I noticed some relevant information on this Forum regarding legacy Studio 2014 but that piece of advice is thankfully very now old - because it seemed to be a multi-stage workaround of incredible technical complexity. To be honest I'd rather need a real-life fix which simply works, just like the other CAT does it. Surely, there must be some setting deep inside the newest Studio that I'm not noticing? 

Please would anyone be able to advise the quickest way to correct that segmentation issue?

Many thanks indeed, 

Adam

Parents
  • Hi,

    I'm one of the developers at OnTheGoSystems, the creators of WPML.

    We decided to deliver this basic solution (dump the content into CDATA blocks without any processing) to help our clients to extract content from the websites and avoid a lot of copy/paste work. We are aware that XLIFF files we generate are not perfect (but still valid according to XLIFF 1.2 specification). We are checking solutions to improve it for some time but still do not have anything that would guarantee that we will be able to recreate translated page (based on the XLIFF with previously extracted and segmented content) without any issues. We do not want to release anything that would generate even more problems for our clients and translators.

    We are really interested to make our clients and translators live easier. As mentioned by Paul, the problem is that it is a very tricky task to extract only text from the content, generate sentences (segmentation), generate trans-units and put it back together when translation is ready. This is the result of how WordPress itself is designed and how data is stored in WordPress database (content mixed with HTML tags). We can not forget that this should work not only for English but also for all other languages (different segmentation rules).

    WPML 3.6 release brought a lot of improvements including better support for content generated by Page Builders. Thanks to that you won't see a lot of shortcodes ([] tags) in your XLIFFs anymore.

    I'm pretty sure that we will finally find also a solution to generate XLIFF files that will make translators happy.

    I will keep you posted :)

    Pawel

Reply
  • Hi,

    I'm one of the developers at OnTheGoSystems, the creators of WPML.

    We decided to deliver this basic solution (dump the content into CDATA blocks without any processing) to help our clients to extract content from the websites and avoid a lot of copy/paste work. We are aware that XLIFF files we generate are not perfect (but still valid according to XLIFF 1.2 specification). We are checking solutions to improve it for some time but still do not have anything that would guarantee that we will be able to recreate translated page (based on the XLIFF with previously extracted and segmented content) without any issues. We do not want to release anything that would generate even more problems for our clients and translators.

    We are really interested to make our clients and translators live easier. As mentioned by Paul, the problem is that it is a very tricky task to extract only text from the content, generate sentences (segmentation), generate trans-units and put it back together when translation is ready. This is the result of how WordPress itself is designed and how data is stored in WordPress database (content mixed with HTML tags). We can not forget that this should work not only for English but also for all other languages (different segmentation rules).

    WPML 3.6 release brought a lot of improvements including better support for content generated by Page Builders. Thanks to that you won't see a lot of shortcodes ([] tags) in your XLIFFs anymore.

    I'm pretty sure that we will finally find also a solution to generate XLIFF files that will make translators happy.

    I will keep you posted :)

    Pawel

Children