Segmentation rule | exceptions | full stop rule

Hi all,

Does anyone know if it is possible to make sure segments are not separated by a full stop in any case that is not the end of a sentence? So for example in names for people, like 'F.G. de Groot', but also after a common abbreviation that is followed by an uppercase in the following word. I've noticed that it doesn't separate the segments when it's followed by a word starting with a lowercase letter. I've also noticed Trados Studio 2022 sometimes breaks up hyperlinks after the :\\, which will have to manually be merged again. 

I hope you know what I'm trying to get at and have some solutions for me I haven't tried yet. I've tried \p{Lu} before the break as an exception to the full stop rule (as found under another post in this forum), which seems to work for the names for people (thank god), but that's only part of the problem it appears. And I'm not exactly an expert on what every bit of a regular expression means exactly, so I'm not sure what I need to add or delete in order to get exactly what I'm trying to achieve from it.

Thanks in advance,

Charley

emoji
Parents
  •  

    I created a small test file to have a play with:

    ### 🔹 **Names and Initials (should \*not\* segment):**
    
    1. The manuscript was signed by F.G. de Groot.
    2. Please refer to the comments from A.J. Smith and C.P. Haanstra.
    3. The case was ruled on by M.L. King Jr. in 1964.
    
    ------
    
    ### 🔹 **Abbreviations followed by Uppercase (should \*not\* segment):**
    
    1. This was confirmed in the meeting with Prof. Andrew Marks.
    2. The project will begin in Jan. 2025, as planned.
    3. The goods were delivered by DHL Exp. Services.
    
    ------
    
    ### 🔹 **Abbreviations followed by lowercase (correctly handled):**
    
    1. The error occurred at approx. 4pm yesterday.
    2. This was agreed upon by e.g. several key stakeholders.
    
    ------
    
    ### 🔹 **Standard sentence endings (should segment):**
    
    1. The client approved the text. We may proceed with publication.
    2. I contacted the team. They responded within the hour.
    
    ------
    
    ### 🔹 **Problematic hyperlink/URL splitting (should \*not\* segment):**
    
    1. Please visit www.example.com/.../start.html for more details.
    2. This is hosted at http:\server.domain.local\shared\folder\file.txt
    3. The tool can be found at downloads.example.org/.../index.zip
    
    ------
    
    ### 🔹 **Other edge cases (optional, for thoroughness):**
    
    1. “Etc.” is not a reason to stop being precise. This should be clear.
    2. The company is based in the U.S. It operates globally.
    3. Refer to para. 3 in the contract. This outlines your obligations.
    
    ------
    
    

    Then opened against a default TM with Trados Studio 2022:

    Screenshot showing segmentation rules for names and initials, highlighting examples like F.G. de Groot and A.J. Smith that should not segment.

    Observations:

    1. The first examples segment on A.J., C.P. and M.L. when I don't want them to.  So I add these as abbreviations:
      Close-up screenshot of the 'Names and Initials' section, showing examples like F.G. de Groot and M.L. King Jr. that should not segment.

    2. The next segment on Exp. so I repeat the exercise for that:
      Screenshot showing segmentation rules for abbreviations followed by uppercase, with examples like Prof. Andrew Marks and Jan. 2025 that should not segment.

    3. Abbreviations followed by lowercase are all good.

    4. Standard sentence endings are all good.

    5. Problematic hyperlink/URL splitting are all good

    6. Other edge cases, shouldn't have segmented on "para."  So I add that as well:
      Screenshot showing examples of edge cases, including 'Etc.' and 'U.S.', which should not segment, and a reference to para. 3 in a contract.

    So no segmentation rules needed for any of these.  All handled correctly or by adding to the abbreviations list.

    If you did this already then I think your problem is either because your source document contains more than just plain text, or you have competing rules that are conflicting.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub



    Generated Image Alt-Text
    [edited by: RWS Community AI at 1:35 PM (GMT 1) on 14 Jul 2025]
Reply
  •  

    I created a small test file to have a play with:

    ### 🔹 **Names and Initials (should \*not\* segment):**
    
    1. The manuscript was signed by F.G. de Groot.
    2. Please refer to the comments from A.J. Smith and C.P. Haanstra.
    3. The case was ruled on by M.L. King Jr. in 1964.
    
    ------
    
    ### 🔹 **Abbreviations followed by Uppercase (should \*not\* segment):**
    
    1. This was confirmed in the meeting with Prof. Andrew Marks.
    2. The project will begin in Jan. 2025, as planned.
    3. The goods were delivered by DHL Exp. Services.
    
    ------
    
    ### 🔹 **Abbreviations followed by lowercase (correctly handled):**
    
    1. The error occurred at approx. 4pm yesterday.
    2. This was agreed upon by e.g. several key stakeholders.
    
    ------
    
    ### 🔹 **Standard sentence endings (should segment):**
    
    1. The client approved the text. We may proceed with publication.
    2. I contacted the team. They responded within the hour.
    
    ------
    
    ### 🔹 **Problematic hyperlink/URL splitting (should \*not\* segment):**
    
    1. Please visit www.example.com/.../start.html for more details.
    2. This is hosted at http:\server.domain.local\shared\folder\file.txt
    3. The tool can be found at downloads.example.org/.../index.zip
    
    ------
    
    ### 🔹 **Other edge cases (optional, for thoroughness):**
    
    1. “Etc.” is not a reason to stop being precise. This should be clear.
    2. The company is based in the U.S. It operates globally.
    3. Refer to para. 3 in the contract. This outlines your obligations.
    
    ------
    
    

    Then opened against a default TM with Trados Studio 2022:

    Screenshot showing segmentation rules for names and initials, highlighting examples like F.G. de Groot and A.J. Smith that should not segment.

    Observations:

    1. The first examples segment on A.J., C.P. and M.L. when I don't want them to.  So I add these as abbreviations:
      Close-up screenshot of the 'Names and Initials' section, showing examples like F.G. de Groot and M.L. King Jr. that should not segment.

    2. The next segment on Exp. so I repeat the exercise for that:
      Screenshot showing segmentation rules for abbreviations followed by uppercase, with examples like Prof. Andrew Marks and Jan. 2025 that should not segment.

    3. Abbreviations followed by lowercase are all good.

    4. Standard sentence endings are all good.

    5. Problematic hyperlink/URL splitting are all good

    6. Other edge cases, shouldn't have segmented on "para."  So I add that as well:
      Screenshot showing examples of edge cases, including 'Etc.' and 'U.S.', which should not segment, and a reference to para. 3 in a contract.

    So no segmentation rules needed for any of these.  All handled correctly or by adding to the abbreviations list.

    If you did this already then I think your problem is either because your source document contains more than just plain text, or you have competing rules that are conflicting.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub



    Generated Image Alt-Text
    [edited by: RWS Community AI at 1:35 PM (GMT 1) on 14 Jul 2025]
Children
No Data