Catching </ol> and </ul>

Hi all,

This has been driving me nuts for a while now. I'm using embedded content (legacy) in an XML file type to catch some common html tags, among other things. All in all this is pretty straightforward, but html lists are giving me quite a headache.

Right now I have this in there:

Start tag: <[o|u]l>

End tag: </[o|u]l>

Segmentation hint: Exclude

What happens is that everything after the first list in a file--either ordered or unordered--is not extracted. I tried all kinds of variations of the above expressions, and also used separate tag pairs for ordered and unordered lists, but the result is always the same.

I should maybe also mention that, unfortunately, the embedded content processors that were introduced recently are not an option, because they are not available in WorldServer

I'd be grateful for any pointers to fix this.

Stephan

Parents Reply
  • Why not take a different approach?  You were correct the first time I think because here the \n did not need to be escaped unless you were trying to find the \ in \n specifically as opposed to a line feed.  So perhaps if you remove the \n rule altogether and then create a segmentation rule in the TM instead you will have more success.

    I have not tried to recreate your situation but \n is quite a catch all and this may be causing a problem elsewhere in the file tagging.  If you add it as a segmentation rule, which is what you are trying to achieve then you may find you have the desired result.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

Children