Segmentation question

I'm trying to force segmentation in xlz files which contain a lot of strings like this one:

...Joseph Lubin.[5] In 2014, 

where Trados doesn't seem to want to segment at the fullstop. I tried a pair of before and after segmentation rules like this:

.    .\[*\]

.\[*\]   .

but nothing seems to happen. Clearly I'm doing something wrong - again :) 

emoji
Parents Reply Children
  •  

    Thanks... I know what it is, but this is most likely why you have a problem.  XLIFF is resegmentable if it only contains source text. Once translations are added, the file becomes fixed-segmentation and segmentation changes are no longer possible without affecting the target content.  Most CAT tools that respect XLIFF will not resegment a bilingual XLIFF.  For example:

    <?xml version='1.0' encoding='UTF-8'?>
    <xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
      <file source-language="en-GB" target-language="de-DE" datatype="plaintext" original="file.txt">
        <body>
          <trans-unit id="1">
            <source>One of the key figures was Vitalik Buterin.[3] In 2013,</source>
          </trans-unit>
          <trans-unit id="2">
            <source>The project gained momentum with the help of Gavin Wood.[7] In 2015,</source>
          </trans-unit>
          <trans-unit id="3">
            <source>Among the early contributors was Charles Hoskinson.[2] In 2016,</source>
          </trans-unit>
          <trans-unit id="4">
            <source>Leadership soon included Aya Miyaguchi.[4] In 2018,</source>
          </trans-unit>
          <trans-unit id="5">
            <source>Another notable participant was Elizabeth Stark.[6] In 2017,</source>
          </trans-unit>
        </body>
      </file>
    </xliff>
    

    If I preview this file, a monolingual XLIFF, I get this:

    Screenshot of Trados Studio preview showing unsegmented XLIFF file with English text. File named 'unsegmented.xliff' with segments numbered 1 to 5.

    I can use a simple rule on the filetype to make the [nr] an excluded tag (which is probably what will help you):

    (?<=\.)\[\d+\]

    This gets me:

    Screenshot of Trados Studio preview showing unsegmented XLIFF file with English text. Segments are displayed without numbers, file named 'unsegmented.xliff'.

    But if I try it with a bilingual:

    <?xml version='1.0' encoding='UTF-8'?>
    <xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
      <file source-language="en-GB" target-language="de-DE" datatype="plaintext" original="file.txt">
        <body>
          <trans-unit id="1">
            <source>One of the key figures was Vitalik Buterin.[3] In 2013,</source>
            <target>Eine der Schlüsselfiguren war Vitalik Buterin.[3] Im Jahr 2013,</target>
          </trans-unit>
          <trans-unit id="2">
            <source>The project gained momentum with the help of Gavin Wood.[7] In 2015,</source>
            <target>Das Projekt gewann mit Hilfe von Gavin Wood an Dynamik.[7] Im Jahr 2015,</target>
          </trans-unit>
          <trans-unit id="3">
            <source>Among the early contributors was Charles Hoskinson.[2] In 2016,</source>
            <target>Zu den frühen Mitwirkenden gehörte Charles Hoskinson.[2] Im Jahr 2016,</target>
          </trans-unit>
          <trans-unit id="4">
            <source>Leadership soon included Aya Miyaguchi.[4] In 2018,</source>
            <target>Zur Führung gehörte bald Aya Miyaguchi.[4] Im Jahr 2018,</target>
          </trans-unit>
          <trans-unit id="5">
            <source>Another notable participant was Elizabeth Stark.[6] In 2017,</source>
            <target>Eine weitere bemerkenswerte Teilnehmerin war Elizabeth Stark.[6] Im Jahr 2017,</target>
          </trans-unit>
        </body>
      </file>
    </xliff>
    

    I'll get this:

    Screenshot of Trados Studio preview showing segmented XLIFF file with English source and German target text. File named 'segmented.xliff' with segments numbered 1 to 5.

    This is because the segmentation takes place on the source, so if the target is already populated Studio doesn't know what to do with it, so refuses the segmentation.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: RWS Community AI at 10:25 PM (GMT 1) on 14 Apr 2025]
  • Thanks Paul - this huge batch happens to be monolingual, so that would do me for now. But which file type exactly do I add this exclusion to and how? Also, is it likely to mess up segmentation of other stuff?

    emoji
  •  

    You add it to the XLIFF filetype as this supports localisation zips as well.

    Also, is it likely to mess up segmentation of other stuff?

    Not if you create a project template and create your projects with this requirement using that specific template.  That way you can have different settings for the same filetype for different customers/use cases.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  • Under Embedded Content? It looks like the only place where I could add that eye-watering thing above. But it's greyed out

    Screenshot of Trados Studio Options dialog showing Embedded Content settings with 'Enable embedded content processing' option greyed out and unselected.

    emoji


    Generated Image Alt-Text
    [edited by: RWS Community AI at 9:31 AM (GMT 1) on 15 Apr 2025]
  •  

    But it's greyed out

    Simply because you have not enabled the "Embedded Content" which is what this panel is all about.  Also make sure you check "Extract in all paragraphs".

    that eye-watering thing above

    Full Pattern: (?<=\.)\[\d+\]

    This pattern matches text like [123], but only if it is immediately preceded by a full stop (.).

    Small blue diamond (?<=\.)

    • This is a positive lookbehind.

    • It asserts that what immediately precedes the match is a literal full stop ..

    • The (?<=...) part is a zero-width assertion — it checks the condition, but doesn’t include it in the match.

    • So: it matches only if the preceding character is a full stop, but the full stop is not included in the match.

    Small blue diamond \[

    • Matches a literal opening square bracket [.

    • Square brackets have a special meaning in regex (they define character classes), so to match a literal bracket, it must be escaped with a backslash (\).

    Small blue diamond \d+

    • \d matches any digit (equivalent to [0-9]).

    • + means one or more.

    • Together, \d+ matches any number with at least one digit — e.g. 1, 42, 1000.

    Small blue diamond \]

    • Matches a literal closing square bracket ], again escaped because ] also has special meaning in regex.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  • Smashin, that works a treat, thanks very much!

    Thanks also for the explanation but I stand by 'eye watering' - there's a reason I'm a very good translator, rather than a software dev Grin

    emoji