Segment source strings when length greater than 1900 characters

Hello,

I have some documents which contains long sentences of chemical compounds separated with semicolon like "Phosphoric acid; Nitric Acid;..." The segments can reach 6000 characters in some cases. Translators complains about these long segments and even some MT engines refuses to translate that long segments in batch tasks.

I don't want to make " ; "  to break segment because in other standard phrases that uses and needs translated in context whit whole context.

Si I tried to make a Segmentation rule in the TM breaking on the segment length and " ; " but it does not work. I have tried a lot of of flavors but maybe is not possible to do what I lookin for.

Maybe some of you can help on this?

Thanks in advance.

emoji
  • "Probability score": 0.7

    "Question improvement suggestion": Please provide more information about the exact version of Trados Studio you are using, the operating system, and any recent changes you have made to the system or software. Also, it would be helpful if you could share the exact steps you have taken to create the segmentation rule and the specific error message, if any, that you received.

    In the meantime, based on the information you have provided, here is a suggested answer to your question:

    Hello,

    It seems like you're trying to create a segmentation rule in Trados Studio to break up long segments of text. While Trados Studio does allow for custom segmentation rules, it doesn't support segmentation based on character count. The segmentation rules in Trados Studio are based on punctuation and specific text patterns, not the length of the segment.

    However, there are a few workarounds you might consider:

    1. Manual segmentation: You can manually split the long segments during the translation process. This can be done by placing the cursor at the desired split point and using the 'Split Segment' command (shortcut: Alt+Shift+T).

    2. Use of external tools: There are external tools available that can split long sentences into smaller ones based on character count before you import the document into Trados Studio.

    3. Adjusting the segmentation rules: While you mentioned that you don't want to use the semicolon (;) as a segment breaker, you might consider creating a specific rule for these chemical compound lists. For example, you could create a rule that breaks on the pattern "; " (semicolon followed by a space), which might be less common in your other texts.

    Remember to always test your new rules on a sample document to ensure they work as expected before applying them to your entire project.

    I hope this answer addressed your question. If not, please reject the answer, add the question improvement suggestion information, and click the RWS Community AI checkbox on the right to try again.

    RWS Community AI

    emoji
  • You said "you could create a rule that breaks on the pattern "; " (semicolon followed by a space)"

    The rule breaks in ; if the preceding characters length are greater than 1900. Is that possible?

    emoji
  • "Probability score": 0.6

    "Question improvement suggestion": Please provide more information about the exact version of Trados Studio you are using, the operating system, and any recent changes you have made to the system or software. Also, it would be helpful if you could share the exact steps you have taken to create the segmentation rule and the specific error message, if any, that you received.

    In the meantime, based on the information you have provided, here is a suggested answer to your question:

    Hello,

    Unfortunately, Trados Studio's segmentation rules do not support breaking segments based on character count. The segmentation rules are based on punctuation and specific text patterns, not the length of the segment. Therefore, creating a rule that breaks at a semicolon (;) only if the preceding characters length are greater than 1900 is not possible within Trados Studio's current capabilities.

    However, you might consider using external tools that can split long sentences into smaller ones based on character count before you import the document into Trados Studio.

    I hope this answer addressed your question. If not, please reject the answer, add the question improvement suggestion information, and click the RWS Community AI checkbox on the right to try again.

    RWS Community AI

    emoji
  •  

    The problem with your regex is that {1900,} matches as many characteres as possible, being greedy. You can test your regexes here: https://regex101.com/

    Screenshot showing a regex pattern highlighted with the numbers 1900 followed by a comma and a curly brace, indicating a greedy match for 1900 or more characters.

    Instead of your regex:

    ^.{1900,}

    this regex will do the trick:

    .{1900}

    I did a quick test and it seemed to work!

    emoji


    Generated Image Alt-Text
    [edited by: RWS Community AI at 10:08 AM (GMT 1) on 21 Apr 2025]
  • Hi Jesus,

    Yes, It works perfectly.

    Thanks a lot for your help!!.

    Warm regards from your old colleague.

    Nacho Zanon.

    emoji