Segmentation: how to segment after full stop if followed by a tag?

I'm trying to teach Studio to start a new segment if it comes across a full stop followed by a tag but I didn't manage anything usable so far.

Here the source sentence:
When operating the engine switch, one short, firm press is enough.<ch id="1234" text="HardReturn" />It is not necessary to press and hold the switch.

I've tried adding a new additional fullstop rule, either with a simple <[^>]+> or a (?=<[^>]+>) as text after break, even escaping the angular brackets, to no avail… any help or suggestions are welcome!

 

  • Hi ,

    Instead of using an expression in the "After break" just use "Anything" as you have before the break. This worked in the small xml file I knocked up.

    I don't know where your source file came from of what format it was, but if it was XML it would be better to create a rule for the tag rather than use segmentation. But I guess you know this so it would be interesting to know what kind of source file it was to be able to test this properly.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi Paul, and thanks for your quick reaction!

    It is indeed an XML file, but since tags and specifically such hard return tags can also occur in the middle of a sentence, I don't want to define them as break points in the file type settings, hence my quest to find a working pattern for the segmentation rules.

    What I try to achieve is to have the full stop followed by a tag (and then an uppercase letter ideally) as break character.

    I'll follow your pointer and give \.(<[^>]+>) and \.(?=<[^>]+>) as break characters a try, followed by anything. Just using break character full stop followed by anything would also break at decimal and version numbers for instance, no?

  • This didn't work neither, but what did the trick was to check the box  that the tag can act as word end inside text in the file type settings: