I do not want a page break character to create a new segment.

I have recently taken a deep dive back into Studio segmentation rules.

One thing I want is to not break a segment based on a page break.

If a sentence is split by a page break, I want the segment to be the whole sentence.

I have searched high and low, but cannot find this.

The only thing I could find is articles on errors when segments were merged over page breaks.

Parents Reply
  • Dear Paul,

    I see now how you are baffled by something that is absolutely normal for our workflow.

    The matter is that we very often convert scanned pdfs into Word.  In the original document in Word, there was no page break and the sentence/paragraph flowed over into the next page, but we put a page break in so that we can keep parallel pagination in the final source and target files.

    I think the answer may lie in replacing ^m with XXX or probably better ~~~ as an automated pre-edit in Word (we already have a macro that does a large number of pre-edits to prepare a file for optimal MT) before it goes into Studio.  The key thing is manually editing (or having a macro) after export back from Studio into Word to replace the ~~~ with page breaks so that we get our nice parallel source and target pagination again in the target Word file.

    This will still be much easier than just putting page breaks in the target file by comparing it to the source file after translation, which can be very tedious, especially at the end of a lengthy translation.

    In Studio we will get a slightly messy fuzzy match, but it will be clear what it is with the ~~~ and better than splitting the sentence into segments.

    A bit of a messy workaround, but we can standardize this and make it workable.

    My long term question would if Studio doesn't break on other tags, such as <bold>, why are other tags absolute segment-breakers?  I read your other posts on line-breaks, and it seems these can go either way in Studio.  Are paragraph markers more like page breaks or line breaks?

    Thanks for taking a look at the NYT article - that was fun, and I felt it was important to speak up and share some of the good news in this crisis Slight smile

    Head's up - I had a huge nightmare with Language Resource Template last night, but I will be posting that in an additional thread.

    Thanks again for your excellent advice and insightful question.

    Best regards,

    Tom

Children