Fixing Segmentation in TM

Hello team,

 

I have an issue with a Large Segments TM that I would like to break/fix the segmentation in order to use the Fuzzy Matches in the new files version.

The files are XLIFF files which are an export from WPML I received from client.

We created the project in SDL Studio which created the project with very large segments (although there were punctuation), so now we have fixed the issues and we have a perfect segmentation, but the TM includes the large segments translation.

 

It would be great if you can help us ASAP.

 

Best Regards

Fouad

Parents Reply Children
  • Oh dear... One problem might have been the regex I gave you didn't work. I'm Sorry.. :(

    I just sent you a new copy of the file. This is what I did:

    I first searched for <.*?> and replaced with $&[$SPLIT$] to add split markers after all the tags. For some reason Olifant won't split next to tags though. However, the tags don't look like they should go into translatable segments at all, so I stripped the file off the tags completely, and performed the split afterwards.

    Then I simply split after periods as mentioned above, and removed any stray split markers that were left because of a mismatch of periods in the source and target segments.

    I didn't touch the colons though, as different punctuation seemed to be used in EN and IT (e.g. dashes in EN were replaced with colons in IT), and I was worried I might break more than I fix.

    I did a couple spot checks, and overall the new file looks OK. Still what I did was purely mechanical, so it might be advisable to apply a penalty to the TM just in case some segments are mismatched.

    Let me know if you get better results with the new TM.

    Ta

    Stephan

  • Hi,

    It will be good to know if this useful response from Stephen works... but in the meantime I'll add another that might be useful.  There is an application on the OpenExchange called SDLTmConvert which can convert an SDLTM to a CSV file.  So if your efforts with Olifant are unsuccessful you could use this to then create two CSV files, one with the source only and one with the target.  Then you align the two files in Studio and this time the paragraphs with multiple sentences will be segmented correctly allowing you to create a new SDLTM to use.

    The drawback of this might be how it handles tags because I have no idea how complex the information in your TM is, but if you are happy with a text only TM then this might be a simple way to tackle the problem.

    I like the Olifant solution though... it's a good tool if you're careful.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi Stephan,

    Can you please let me know how did you remove all tags from the TMX? You have indicated you have removed all tags from the file as they should not be in the TM.

    The analysis of Italian was improved and this was helpful – I need to that also for the other languages I have.

    Thanks

    Fouad

  • Hi Fouad,

    Sure, that was simply a matter of finding all tags and replacing them with nothing.

    So just find <.*?> (check "Use regular expressions"), leave the "Replace by" field empty and click on "Replace all".

    Just be sure to add the split markers first before you remove the tags as they may be the only indication of where the markers should go.

    That's also how I removed the leftover split markers.

    Glad that worked for you. I'm curious now about TM Convert, too. I'm sure there are plenty of ways to go about this.

    Stephan

  • Unfortunately, it didn't work.

    The investment on the new segmentation was OK so we adjusted the translation by linguists.

    Now I can see a new issue.

    As I have created the Setting for these kind of XLIFF per your guidance in SDL Studio 2014, linguists that have SDL Studio 2011 and lower, have problems opening the package.

    Is that reasonable even though I create the package for SDL Studio 2009?

    Fouad