How to use regular expressions in XML file types to mark placeholders?

I created a custom XML file type for an XML file with embedded HTML content and would like to mark formatting strings enclosed in rectangular brackets as placeholders.

Here's an excerpt:

<content:encoded><![CDATA[<p>This is [B]an[/B] example.</p>]]></content:encoded>

What kind of XPATH query will I need to use to select all strings enclosed by rectangular brackets?

Parents Reply
  • Unfortunately you cannot combine more than two parser settings, so you need to define both "normal" tags and []-tags in one parser, using regex.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

Children
  • Thanks for confirming my suspicions. IMHO, it's a major design flaw that users will need to create a plaintext based HTML parser when they want to add custom placeholders to embedded HTML content.

  • it's a major design flaw

    I disagree.  It's definitely something that would be a beneficial enhancement, but the use of what looks a bit like bb code inside an html embedded inside an xml file is hardly valid html, so it's not a flaw.

    If you don't care to see <strong> as bold in the editor, for example, and only wish to ensure the tags are handled then adding the appropriate tags isn't that difficult using regex as you don't need to create separate rules for each tag type.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • I think you are expecting too much from the Studio parser. There is an XML parser to handle the XML content and an HTML parser to handle the embedded content. If you have custom requirements, you might have to find a custom solution. One simple way might be to turn the [B]stuff[/B] into tags. If you pre-process your files to arrive at something like this:

        <content><![CDATA[<p>This is <custom_content value="[B]an[/B]"/> example.</p>]]></content>

    (I used the regex (\[B\].*?\[\/B\]) for matching and <custom_content value="$1"/> for replacing.)

    How easy that is depends a bit on the content: Is it always [B] or can it be [All kind of things]? What is enclosed? If it's characters like <, ", ', & the content might need escaping.

    If it is as simple as it is in your sample, then the above will show in the editor a like that:

    Daniel

    EDIT: I should add that if you go this route, you will have to post-process the files accordingly. Just to state the obvious.

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 4:37 AM (GMT 0) on 5 Mar 2024]

  • Thanks for your very helpful reply!!! I might actually have to go that route. I'm also considering post-processing the XLIFF file.

    It's definitely something that would be a beneficial enhancement, but the use of what looks a bit like bb code inside an html embedded inside an xml file is hardly valid html, so it's not a flaw.

    You have a point there, but, IMHO, it doesn't make sense that the embedded HTML content processor doesn't offer the same options as the regular HTML file  type. You're basically forcing Studio users who want to create custom placeholders in embedded HTML content to create a make-shift HTML parser, when Studio already comes with one.

    Also, I should be possible to select strings inside of tags using XPATH commands such as text(), but Studio doesn't seem to support these commands.

  • IMHO, it doesn't make sense that the embedded HTML content processor doesn't offer the same options as the regular HTML file  type.

    As Daniel already pointed out, you are asking far too much of the tool.  The HTML filetype can use an embedded processor to handle embedded content.  But seeing as you are actually handling XML and not HTML you are asking it to use an embedded content processor inside an embedded content processor.  Quite a big ask and nothing to do with making sense.

    Also, I should be possible to select strings inside of tags using XPATH commands such as text(), but Studio doesn't seem to support these commands.

    text() would be used to select text between tags and not inside tags.  And it certainly does work.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • text() would be used to select text between tags and not inside tags.  And it certainly does work.

    Can you please provide an example of the text() syntax that SDL Studio supports?

  • Can you please provide an example of the text() syntax that SDL Studio supports?

    It supports the correct way to use it in XPath:

    //*[contains(text(), 'toast')]

    In this case any segments at all containing the word toast.

    //VariableAssignment[answer/text()='42']/Value

    A way to extract the content of the <Value> element ONLY if the value of the <answer> element is 42 - using a conditional xpath expression;
    Screenshot of Trados Studio showing an XML code snippet with a VariableAssignment element containing nested elements for Name, Value, ValueObject, Style, and PublishPrompt.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 4:37 AM (GMT 0) on 5 Mar 2024]