Fixing Segmentation in TM

Hello team,

I have an issue with a Large Segments TM that I would like to break/fix the segmentation in order to use the Fuzzy Matches in the new files version.

The files are XLIFF files which are an export from WPML I received from client.

We created the project in SDL Studio which created the project with very large segments (although there were punctuation), so now we have fixed the issues and we have a perfect segmentation, but the TM includes the large segments translation.

It would be great if you can help us ASAP.

Best Regards

Fouad

Translate

Rate translation

Suggest better translation

Moderator UI

Thread Subject & Description
Fixing Segmentation in TM Hello team, I have an issue with a Large Segments TM that I would like to break/fix the segmentation in order to use the Fuzzy Matches in the new files version. The files are XLIFF files which are an export from WPML I received from client. We created the project in SDL Studio which created the project with very large segments (although there were punctuation), so now we have fixed the issues and we have a perfect segmentation, but the TM includes the large segments translation. It would be great if you can help us ASAP. Best Regards Fouad
Get AI Suggestion

AI Reply

Accept answer Reject Answer

Parents

0 Stephan Gasteyer over 10 years ago

Hi Fouad,

I usually use Olifant to clean up TMs. If you can identify a specific break character or pattern you could find and replace this pattern, adding the string [$SPLIT$] where you want the split to occur. See:

okapi.sourceforge.net/.../mnu_entries.htm

Stephan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Yael Reinhold over 10 years ago in reply to Stephan Gasteyer

Hi Stephan,

Basically, the break should be according to the default Studio segmentation.

Can this be set by a filetype setting that i've created?

Best Regards

Fouad
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Stephan Gasteyer over 10 years ago in reply to Yael Reinhold

There's no way to perform this kind of operation directly in Studio, so I'm afraid it's not as simple as applying a file type to your TM (though that would be a neat feature for future versions of Studio, cause I have a similar cleanup job coming up myself ;)).

Assuming you do use Olifant to split the segments, and you just want to apply basic segmentation rules, it could be as simple as splitting at periods or semicolons.

So you'd open your TM in Olifant and use the following search/replace patterns:

Find: ". "

Replace with: ". [$SPLIT$]"

When you have done that, select all segments and use the split command. Then import the clean TMX into a new SDLTM.

If you need to, say, split after all tags in a TM, you'd have to use regex:

Find: \<.*\>

Replace with: $& [$SPLIT$]

Make sure that "use regular expressions" is checked when you do this.

There are some helpful examples of how to use regex in Olifant here: http://okapi.sourceforge.net/Release/Shared/Help/regex.htm
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Yael Reinhold over 10 years ago in reply to Stephan Gasteyer

Hi Stephan,

This doesn't work.

Can i send you the TMX by email?

Best Regads

Fouad
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Stephan Gasteyer over 10 years ago in reply to Yael Reinhold

Hi Fouad,

I can give it a shot, sure. I can't guarantee though if I can get to it straight away. I'll have a look at it, but if it's not straightforward I may have to deal with it later.

You can send it to stephan dot gasteyer at gmail dot com. Also, don't forget to let me know exactly where you want the break to occur.

Cheers

Stephan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Yael Reinhold over 10 years ago in reply to Stephan Gasteyer

Thanks

Just sent!

BR

Fouad
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Stephan Gasteyer over 10 years ago in reply to Yael Reinhold

Oh dear... One problem might have been the regex I gave you didn't work. I'm Sorry.. :(

I just sent you a new copy of the file. This is what I did:

I first searched for <.*?> and replaced with $&[$SPLIT$] to add split markers after all the tags. For some reason Olifant won't split next to tags though. However, the tags don't look like they should go into translatable segments at all, so I stripped the file off the tags completely, and performed the split afterwards.

Then I simply split after periods as mentioned above, and removed any stray split markers that were left because of a mismatch of periods in the source and target segments.

I didn't touch the colons though, as different punctuation seemed to be used in EN and IT (e.g. dashes in EN were replaced with colons in IT), and I was worried I might break more than I fix.

I did a couple spot checks, and overall the new file looks OK. Still what I did was purely mechanical, so it might be advisable to apply a penalty to the TM just in case some segments are mismatched.

Let me know if you get better results with the new TM.

Ta

Stephan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Reply

0 Stephan Gasteyer over 10 years ago in reply to Yael Reinhold

Oh dear... One problem might have been the regex I gave you didn't work. I'm Sorry.. :(

I just sent you a new copy of the file. This is what I did:

I first searched for <.*?> and replaced with $&[$SPLIT$] to add split markers after all the tags. For some reason Olifant won't split next to tags though. However, the tags don't look like they should go into translatable segments at all, so I stripped the file off the tags completely, and performed the split afterwards.

Then I simply split after periods as mentioned above, and removed any stray split markers that were left because of a mismatch of periods in the source and target segments.

I didn't touch the colons though, as different punctuation seemed to be used in EN and IT (e.g. dashes in EN were replaced with colons in IT), and I was worried I might break more than I fix.

I did a couple spot checks, and overall the new file looks OK. Still what I did was purely mechanical, so it might be advisable to apply a penalty to the TM just in case some segments are mismatched.

Let me know if you get better results with the new TM.

Ta

Stephan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Children

0 Paul over 10 years ago in reply to Stephan Gasteyer

Hi,

It will be good to know if this useful response from Stephen works... but in the meantime I'll add another that might be useful. There is an application on the OpenExchange called SDLTmConvert which can convert an SDLTM to a CSV file. So if your efforts with Olifant are unsuccessful you could use this to then create two CSV files, one with the source only and one with the target. Then you align the two files in Studio and this time the paragraphs with multiple sentences will be segmented correctly allowing you to create a new SDLTM to use.

The drawback of this might be how it handles tags because I have no idea how complex the information in your TM is, but if you are happy with a text only TM then this might be a simple way to tackle the problem.

I like the Olifant solution though... it's a good tool if you're careful.

Paul Filkin | RWS Group

________________________
Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Reject Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Yael Reinhold over 10 years ago in reply to Stephan Gasteyer

Hi Stephan,

Can you please let me know how did you remove all tags from the TMX? You have indicated you have removed all tags from the file as they should not be in the TM.

The analysis of Italian was improved and this was helpful – I need to that also for the other languages I have.

Thanks

Fouad
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Stephan Gasteyer over 10 years ago in reply to Yael Reinhold

Hi Fouad,

Sure, that was simply a matter of finding all tags and replacing them with nothing.

So just find <.*?> (check "Use regular expressions"), leave the "Replace by" field empty and click on "Replace all".

Just be sure to add the split markers first before you remove the tags as they may be the only indication of where the markers should go.

That's also how I removed the leftover split markers.

Glad that worked for you. I'm curious now about TM Convert, too. I'm sure there are plenty of ways to go about this.

Stephan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Yael Reinhold over 10 years ago in reply to Stephan Gasteyer

Unfortunately, it didn't work.

The investment on the new segmentation was OK so we adjusted the translation by linguists.

Now I can see a new issue.

As I have created the Setting for these kind of XLIFF per your guidance in SDL Studio 2014, linguists that have SDL Studio 2011 and lower, have problems opening the package.

Is that reasonable even though I create the package for SDL Studio 2009?

Fouad
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Trados Studio > 1. Trados Studio

Fixing Segmentation in TM