Xpath parser to exclude part of text from contetn

Hello,

I am trying to figure out how to exclude part of text using Xpath.

I have sample text in specific structure:

<main>

<section tag="1">

<sub-section id="a">sample_content<sub-section>

<sub-section id="b">sample_content<sub-section>

</section>

<section tag="2">

<sub-section id="c">sample_content<sub-section>

<sub-section id="d">[value_text] sample content<sub-section>

</section>

</main>

I've tried to get text  using Xpath:

  • //section[@tag='2']/sub-section[@id='d]

However, it is not enough to exclude "sample_content" from this line.

Result is:

[value_text] sample content.

My goal is:

value_text

I was looking for solution on internet (this website too) but I didn't get any.

I know that Trados Studio only use Xpath 1.0 that doesn't allow to mix Xpath with regular expressions. Also, I couldn't find any useful Xpath functions for my problem.

Do you have any ideas how to handle this problem?

I use Trados Studio 2019 SR2. I created Filetype XML (embedded content).

Kind Regards,

Adrian

  • It is possible, but not with xpath alone, at least not xpath 1.0.  First of all you create your parser rule, exactly as you have done.  Then you add some structure context like this for example:

    Screenshot of Trados Studio interface showing a red arrow pointing to the 'Paragraph' tab in the menu.

    Then activate the embedded content processor and create a rule using regex with one of the ways available.  I used the "Defined by document structure information" as I added the "Paragraph" context above:

    Screenshot of Trados Studio settings window with red arrows highlighting the 'Embedded Content Processor' and 'Defined by document structure information' options.

    I based this on your specific example, but it might give you an idea if your actual files are a little different.  This then gets me the following:

    Screenshot of Trados Studio preview window displaying the successful extraction of 'value_text' elements from an XML document.

    Which seems to be what you're after.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 4:31 AM (GMT 0) on 5 Mar 2024]
  • Hello Paul,

    Thank you for answer. It is very helpful. You are very Experienced user.

    I've started to test it, however it seems to work only when "[...]" shows at the beginning of content. 

    What in case when I got more complicated sentence that contain more bracket text? Like in that example:

    <sub-section id="d">[value_text] sample content [value_text] sample content [value_text]<sub-section>

    Should I add some RegEx formula to not translate text between ']' and '[' ? Is it proper way to add every possible case separated? Do you know better way to handle this problem?

     

    Kind Regards,
    Adrian

  • however it seems to work only when "[...]" shows at the beginning of content. 

    Correct... that's because I based the expression on your simple example only.

    What in case when I got more complicated sentence that contain more bracket text? Like in that example:

    <sub-section id="d">[value_text] sample content [value_text] sample content [value_text]<sub-section>

    Should I add some RegEx formula to not translate text between ']' and '[' ? Is it proper way to add every possible case separated?

    This starts to get tricky for several reasons:

    1. you would really need multiple rules for each case
    2. even with multiple rules it would still be hard... is this possible for example?
      <sub-section id="d">[value_text] sample content [value_text] sample content [value_text] sample content [value_text] sample content [value_text]<sub-section>
    3. then you also have to deal with segmentation because how would this be to translate without proper segmentation?
      [value_text][value_text][value_text][value_text][value_text]
    Do you know better way to handle this problem?

    Tell us more about the file as a whole.

    • does the rest of the file need to be translated?
    • are the texts in the square brackets consistent and repeat themselves?

    Without the whole story it's very difficult to do what you are suggesting or to try and offer a sensible solution.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Paul,

     

    I have a very big .xml file that contains a lot of cases like in example.
    I figured out how to find only required content using Xpath like in my first post.

    Rest of file shouldn't be translated. Only specific cases. For example: "Please translate ONLY text in brackets (rest of text should be ignored by studio) localized in section tag ="5" and sub-section id="c". Text is random and inconsistent. They don't repeat themselves.

    For better representation this should be more clear:

    <section tag="2">
    <sub-section id="e">Lorem ipsum [text 1] dolor sit amet, consectetur adipiscing elit. [text 2] Integer id ullamcorper magna,...</sub-section>
    </section>

    My goal is  get in studio (in this case) 2 segments:

    text 1

    text 2

    I am trying to automate this formula for every case in file. So there are only few cases that can appear in content:

    [text] normal text
    normal text [text]
    normal text [text] normal text
    normal text
    [text] normal text [text]
    [text] normal text [text] normal text [text]

    or

    [text][text][text]

    Am I need to create all RegEx formula for each separately?

     

    This is not actual translation, It is kind of Localization skills test that shows if there is possibility to resolve specific problem.
    I am trying to improve my studio skills. Xpath and regular expressions seems to be core of good Localization knowledge.

     

    Kind Regards,
    Adrian 

  • In this case you'd be better of doing something like this:

    1. create your filetype with the xpath expression previously agreed
    2. Use this expression to create a placeholder instead of the tag pair:
      (?<!\[)\b[\w\s]+\b(?![\)])

    This will select everything apart from the text in the brackets... like this for example where I even used a really extreme example:

    Screenshot of Trados Studio showing a series of segmented texts with some parts highlighted in purple and orange, indicating the use of a placeholder in place of tag pairs.

    And if you then set the embedded content rule to "exclude" you can even get the segmentation:

    Screenshot of Trados Studio displaying a table with multiple rows labeled 'value_text' and corresponding highlighted segments, demonstrating the exclusion of embedded content in segmentation.

    Looks like it's what you needed?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 4:31 AM (GMT 0) on 5 Mar 2024]
  • Paul,

    That almost it. To make this easier I going to paste here some real examples. Text is totally random and doesn't have much sense. Getting text from brackets is goal:

    I have create XML file type with (Legacy embedded content).

    Here are parsers:

    And embedded content for Paragraph;

    Everything seem to be correct, However this formula doesn't recognise digits and non-Word characters.

    Result:

    Adding \d attribute to this RegEx should resolve part of missing digits.
    Dots and commas are bigger trouble for me. Is \W enough attribute to handle with them?

    Kind regards,

    Adrian