TM Segmentation

Question

Hi, I would like to segment the translation memory so that it never breaks at any abbreviation but breaks at the end of each sentence. Here the file is in Excel format, where some cells contain multiple sentences and most cells have no ending period. (This is not an omission: this text represents in-house labels which should not end with a period but which often contain abbreviations which should not be broken.) Taken individually, each rule works fine in a regex tester (.net flavor) but when used for segmenting the TM, they don't produce the expected result. It seems that the rules don't interact properly. Could you explain how the segmentation works? Is there a priority in the application of the rules? Is the parser of the type Posix? NDA? DFA? Also, I would need to know: if
and
are already tagged in the file (as inline tags in embedded content), do they 'disappear' in the eye of the regex? If not, how should these tags be represented in the regex literal? Finally, before answering, please note: Our files are most of the time in Excel format, but we also process pdf and doc occasionally. Note as well that an abbreviation list would not be a solution since the abbreviations we are dealing with are not conventional and are numerous. Here is the sample text (from Excel), and the regexes used in the Language resource segmentation. # Upcoming Site Maintenance {0} # To accommodate core upgrades to the Service, several parts of the website will be inaccessible from Jan until Feb.
Be prepared for some of your normal functions to be offline and plan to complete essential time-sensitive activities such as report submissions before this period begins. Far East Bench Mark Classification (iBoxxADBIBM) Liq. High Yield Classification (iBoxxUSDLHY) iBoxx USD Liq. High Yield Markit iBoxx USD Liq. Invst. Grade Classification (iBoxxUSDLIG) iBoxx USD Liq. Invst. GradeSource: RealClearPolitics Number of electoral votes based on latest polls.
270 votes are required to win the election. Segmentation goes as follows: Before break After break Full stop \w\. $| |[ ] Lower case exception \w+\.+ [ ][a-z] 1 st abbreviation exception \w+\.+ \.(([ ][A-Z]\w+\.?)([ ][A-Z]\w+\.?)+) 2 nd abbreviation (?<=\.[ ]\w+)\. ([ ]+[A-Z]\w+\.?)+ Thanks a lot for your attention.

Paul · Answer

Hi Linda, 
 The regex flavour used by Studio is .NET. I had a quick play with your example and can get this which is what I think you're after:

However, I think you will have to use abbreviation lists because otherwise I'm not sure how you will distinguish between a real end of a sentence and some of your abbreviations? If you have your lists already you can get them in with a bit of a workaround so you do them all in one go. 
 I achieved this with no custom segmentation rules at all, but I did add the abbreviations for the time being: 
 Unknown said: 
 Liq. Invst. 
 
 I then did the rest with placeholder rules on the Excel filetype: 
 Unknown said: 
 <(?:br|BR)> \n\n # {\d+} 
 
 You need to set the first two to "Exclude" in the segmentation hint under Advanced. 
 Regards 
 Paul

Paul · Answer

Hi Linda Beauvais , 
 
I have had a play with your rules and can't make them work in 2015 or 2017... I get exactly the same result in both. Can you provide a TM that works for you in 2015 and not 2017 and then I can investigate this more easily? 
 
Thank you 
 
Paul

Trados Studio > 5. Regex and XPath

TM Segmentation