I do not want a page break character to create a new segment.

I have recently taken a deep dive back into Studio segmentation rules.

One thing I want is to not break a segment based on a page break.

If a sentence is split by a page break, I want the segment to be the whole sentence.

I have searched high and low, but cannot find this.

The only thing I could find is articles on errors when segments were merged over page breaks.

Parents Reply
  • Thanks for the small file.  I don't think you have a solution in Studio though because these page breaks are not simple character codes, they are actual tags and Studio will move these into an external position and split the segment either side.  If you view "All Content" you can see them:

    Screenshot of Trados Studio showing text in a foreign language with a page break tag highlighted in the content.

    The best I can suggest is you search replace them in Word:

    Screenshot of the Find and Replace dialog in Microsoft Word with the 'Find what' field containing a caret and 'm' symbol, and the 'Replace with' field empty.

    So search for ^m and then replace with nothing at all.  That will remove all the page breaks in your document and if you wanted Studio to ignore them anyway this may be the way forward. The only proble being putting them back afterwards.  If you need to do that perhaps replacing ^m with something like (XXX) for example would help as you could repeat the process in the target but replace (XXX) with ^m. Not great for your TM but

    Quite a tricky one I think... how did the document get produced like this in the first place?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 9:04 PM (GMT 0) on 28 Feb 2024]
Children
  • Dear Paul,

    I see now how you are baffled by something that is absolutely normal for our workflow.

    The matter is that we very often convert scanned pdfs into Word.  In the original document in Word, there was no page break and the sentence/paragraph flowed over into the next page, but we put a page break in so that we can keep parallel pagination in the final source and target files.

    I think the answer may lie in replacing ^m with XXX or probably better ~~~ as an automated pre-edit in Word (we already have a macro that does a large number of pre-edits to prepare a file for optimal MT) before it goes into Studio.  The key thing is manually editing (or having a macro) after export back from Studio into Word to replace the ~~~ with page breaks so that we get our nice parallel source and target pagination again in the target Word file.

    This will still be much easier than just putting page breaks in the target file by comparing it to the source file after translation, which can be very tedious, especially at the end of a lengthy translation.

    In Studio we will get a slightly messy fuzzy match, but it will be clear what it is with the ~~~ and better than splitting the sentence into segments.

    A bit of a messy workaround, but we can standardize this and make it workable.

    My long term question would if Studio doesn't break on other tags, such as <bold>, why are other tags absolute segment-breakers?  I read your other posts on line-breaks, and it seems these can go either way in Studio.  Are paragraph markers more like page breaks or line breaks?

    Thanks for taking a look at the NYT article - that was fun, and I felt it was important to speak up and share some of the good news in this crisis Slight smile

    Head's up - I had a huge nightmare with Language Resource Template last night, but I will be posting that in an additional thread.

    Thanks again for your excellent advice and insightful question.

    Best regards,

    Tom

  • Actually, we are manually inserting the page break, so we can just manually insert the custom page-break character symbol(s) instead of putting in a page break at all.  We well still need to replace it in the target translation after export from Studio, but once again, this is easy.

  • A further reflection - this applies not only to scanned pdfs, but whenever you want to keep target file pagination parallel to source file pagination - even if this is a Word file.

    Much easier to mark this in the source rather than try to insert page breaks after the target file has been produced.

  • My long term question would if Studio doesn't break on other tags, such as <bold>, why are other tags absolute segment-breakers?  I read your other posts on line-breaks, and it seems these can go either way in Studio.  Are paragraph markers more like page breaks or line breaks?

    <bold> is an inline tag, unless it's a complete sentence in which case the tags are moved externally where there is a break in the segment anyway.

    Page breaks are treated as external tags so they force a break.  You would not normally expect a page break to spilt a sentence unless it was an error or as in your case deliberate.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • I actually think you could treat page break as an inline tag, because, like other inline tags, there will almost always be another marker for a break, such as punctuation or a paragraph or table cell marker.

    When you press ctrl+enter in Word, it inserts not only a page break, but paragraph markers also.  You have to take out the paragraph markers additionally if you don't want them.

    As I wrote, I have a work-around, but I think theoretically, one could think about classifying a page break as an inline tag rather than an external one.

    I think it may be treated as an external tag because it is counter-intuitive to think of a page break as an inline tag, but I think in practice it actually could more correct in theory and better in practice to classify it as an inline tag.

    I realize I need to work with my work-around in the short-medium term, but maybe this might be something to put in the idea bin long-term. :)

  • I actually think you could treat page break as an inline tag, because, like other inline tags, there will almost always be another marker for a break, such as punctuation or a paragraph or table cell marker.

    I don't see this usecase at all.  It's too much of an exception and would be completely wrong for most users.  Automatic page breaks are added by Word and we don't set these as external tags because we know what they are.  Manual page breaks are different and I think most users apply them when they have completed writing about one thing and now want to start a new page.  Typically this would not be in the middle of a sentence.

    But if you have a case for it I suggest you post your idea here:

    http://ideas.sdl.com

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub