Stubborn segmentation rule for soft breaks

Hello,

I'm preparing documents to be sent from my company to a translation company and I can't get the settings for the TM to break the right way. I could easily split/merge these while translating them myself but I'll be sending it out as a package and want the most cost efficient setup for high matches.

Here's a part of the original ppt.

And here's what it looks like with a default TM without making any adjustments to the segmentation rules.

After following Paul Filkins video (www.youtube.com%2Fwatch%3Fv%3DkPaHs5xjWyU&usg=AFQjCNEfpR219fgKSI4A29DNhXfl0FuoQg) and doing this, making sure that before break and after break are "anything":

I wind up with this:

If I try removing the full-width periods from "break characters" within the default "terminating punctuation" rule, (I could make an exception rule when following ordinals, I know) I get this:

Just seems like the soft break rule isn't working. 

I'd happily send this page of the pdf to someone (Paul, Jesse?) via e-mail if they could tell me how to tweek my TM so that it breaks nicely.

Even better, if a line break rules could be added to break after tabs like before P.7 that be great. Also, the ideal break would be for numbers when preceded by a period. My ideal segments would look like this with each line break representing a segment break:

0.
はじめに
0.1
用語説明
0.2
Dynamics CRM 基本動作
(1)
各メニューの表示方法
(2)
検索方法
1.
基本動作
・・・P.7
1.1
SRM起動
1.2
初期設定

And so on. Sorry and thanx in advance. 

Best regards,
Keenan

Parents
  • There seems to be something weird going on with the segmentation in Studio recently. I see changes in segmentation being mentioned in release notes (e.g. Studio 2015 SR3), but it's still not working correctly, IMO.

    Additionally, if you look through various SDL knowledge base articles, the mentioned regexes vary, probably depending on the actual writer of the article... :-\

    I have consistent problems with segmentation... I ended up with defining this obscure Linebreak rule
    .?[\r\n][\r\n]
    because anything else simply doesn't separate segments as expected.
    One would expect that e.g. [\r\n]+ would give the same result with CRLF linebreaks, but it doesn't...

    And even with this rule I get weird stuff like this:

    SOURCE:

    RESULT:

    This the beginning of the file.
    Then the segmentation "catches up" and works "sort-of okay" throughout the file... and at the end of the file I get again this:

    SOURCE:

    RESULT:

    Now, this has obviously nothing to do with the regex, there is something going on in the segmntation engine, or something...

Reply
  • There seems to be something weird going on with the segmentation in Studio recently. I see changes in segmentation being mentioned in release notes (e.g. Studio 2015 SR3), but it's still not working correctly, IMO.

    Additionally, if you look through various SDL knowledge base articles, the mentioned regexes vary, probably depending on the actual writer of the article... :-\

    I have consistent problems with segmentation... I ended up with defining this obscure Linebreak rule
    .?[\r\n][\r\n]
    because anything else simply doesn't separate segments as expected.
    One would expect that e.g. [\r\n]+ would give the same result with CRLF linebreaks, but it doesn't...

    And even with this rule I get weird stuff like this:

    SOURCE:

    RESULT:

    This the beginning of the file.
    Then the segmentation "catches up" and works "sort-of okay" throughout the file... and at the end of the file I get again this:

    SOURCE:

    RESULT:

    Now, this has obviously nothing to do with the regex, there is something going on in the segmntation engine, or something...

Children