Catching </ol> and </ul>

Hi all,

This has been driving me nuts for a while now. I'm using embedded content (legacy) in an XML file type to catch some common html tags, among other things. All in all this is pretty straightforward, but html lists are giving me quite a headache.

Right now I have this in there:

Start tag: <[o|u]l>

End tag: </[o|u]l>

Segmentation hint: Exclude

What happens is that everything after the first list in a file--either ordered or unordered--is not extracted. I tried all kinds of variations of the above expressions, and also used separate tag pairs for ordered and unordered lists, but the result is always the same.

I should maybe also mention that, unfortunately, the embedded content processors that were introduced recently are not an option, because they are not available in WorldServer

I'd be grateful for any pointers to fix this.

Stephan

Parents
  • Hi

    What happens if you use this <(ol|ul)> and </(ol|ul)>? These should be more precise...

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Hi Jerzy,

    thanks, but unfortunately this one also produces the same result. I also tested this in other applications and it works just fine. It's just Studio that won't cooperate, and I have no clue where I'm going wrong.

    I also tried escaping pretty much every single character just in case it does something unexpected, but to no avail.

  • This is indeed very strange. Maybe the element which includes the embedded content (such as cData or so) is not defined properly? From my experience that way of using embedded content should work. What you can try too is to change the "exclude" to "may excluce"...

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Hi Stephan,

    It might help to see a sample of the xml file with the elements containing the html code you wish to handle?  Perhaps also mention the other rules you have created?  One of the biggest problems with using the legacy embedded content processor is when you start to add many rules as you can easily get some overlap which can cause unexpected behaviour when you parse the file.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Thank you both for your help. The overlap was a good hint. So I figured I'll add the rules one by one to see where they go awry, and sure enough it wasn't until the last one Confused.

    All embedded content is in a CDATA section. Basically, this section can hold any html formatting. For example:

    <![CDATA[
    <ol>
    <li>Punkt 1</li>
    <li>Punkt 2</li>
    <li>Punkt 3</li>
    </ol>
    <ul>
    <li>Punkt 4</li>
    <li>Punkt 5</li>
    <li>Punkt 6</li>
    </ul>
    ]]>

    These are the rules. Everything works fine until I add \n, at which point any content that comes after the first list disappears.

    Start Tag End Tag Type Translate Segmentation
    <(ol|ul> </(ol|ul)> Tag Pair Yes Exclude
    <li> </li> Tag Pair Yes Exclude
    <a.*?> </a> Tag Pair Yes Include
    <i> </i> Tag Pair Yes Include
    <b> </b> Tag Pair Yes Include
    <sub> </sub> Tag Pair Yes Include
    <sup> </sup> Tag Pair Yes Include
    \{\d\} Placeholder Include
    <br> Placeholder Exclude
    </br> Placeholder Exclude
    <br /> Placeholder Exclude
    \n Placeholder Exclude

    The trouble is that for the output it doesn't seem to make a difference if a new line is triggered by a manual line break or a break tag, so there's no consistency in the source files and I have to segment at both <br> and manual breaks.

    I'm a bit lost now, because now that I was able to isolate \n as the cause of the problem, I have no idea how to prevent it.

    Thanks!

    Stephan


  • \n is wrong, the "\" must be escaped, so it should read \\n

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

Reply Children