Xpath parser to exclude part of text from contetn

Adrian Wojewoda over 5 years ago

Hello,

I am trying to figure out how to exclude part of text using Xpath.

I have sample text in specific structure:

<main>

<sub-section id="a">sample_content<sub-section>

<sub-section id="b">sample_content<sub-section>

</section>

<sub-section id="c">sample_content<sub-section>

<sub-section id="d">[value_text] sample content<sub-section>

</section>

</main>

I've tried to get text using Xpath:

//section[@tag='2']/sub-section[@id='d]

However, it is not enough to exclude "sample_content" from this line.

Result is:

[value_text] sample content.

My goal is:

value_text

I was looking for solution on internet (this website too) but I didn't get any.

I know that Trados Studio only use Xpath 1.0 that doesn't allow to mix Xpath with regular expressions. Also, I couldn't find any useful Xpath functions for my problem.

Do you have any ideas how to handle this problem?

I use Trados Studio 2019 SR2. I created Filetype XML (embedded content).

Kind Regards,

Adrian

Translate

Rate translation

Suggest better translation

Moderator UI

Thread Subject & Description
Xpath parser to exclude part of text from contetn Hello, I am trying to figure out how to exclude part of text using Xpath. I have sample text in specific structure: <main> <section tag="1"> <sub-section id="a">sample_content<sub-section> <sub-section id="b">sample_content<sub-section> </section> <section tag="2"> <sub-section id="c">sample_content<sub-section> <sub-section id="d">[value_text] sample content<sub-section> </section> </main> I've tried to get text using Xpath: //section[@tag='2']/sub-section[@id='d] However, it is not enough to exclude "sample_content" from this line. Result is: [value_text] sample content. My goal is: value_text I was looking for solution on internet (this website too) but I didn't get any. I know that Trados Studio only use Xpath 1.0 that doesn't allow to mix Xpath with regular expressions. Also, I couldn't find any useful Xpath functions for my problem. Do you have any ideas how to handle this problem? I use Trados Studio 2019 SR2. I created Filetype XML (embedded content). Kind Regards, Adrian
Get AI Suggestion

AI Reply

Accept answer Reject Answer

Top Replies

Parents

+1 Paul Filkin over 5 years ago

Adrian Wojewoda

It is possible, but not with xpath alone, at least not xpath 1.0. First of all you create your parser rule, exactly as you have done. Then you add some structure context like this for example:

Then activate the embedded content processor and create a rule using regex with one of the ways available. I used the "Defined by document structure information" as I added the "Paragraph" context above:

I based this on your specific example, but it might give you an idea if your actual files are a little different. This then gets me the following:

Which seems to be what you're after.

Paul Filkin | RWS

Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub

Generated Image Alt-Text
[edited by: Trados AI at 4:31 AM (GMT 0) on 5 Mar 2024]
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Reject Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Adrian Wojewoda over 5 years ago in reply to Paul Filkin

Hello Paul,

Thank you for answer. It is very helpful. You are very Experienced user.

I've started to test it, however it seems to work only when "[...]" shows at the beginning of content.

What in case when I got more complicated sentence that contain more bracket text? Like in that example:

<sub-section id="d">[value_text] sample content [value_text] sample content [value_text]<sub-section>

Should I add some RegEx formula to not translate text between ']' and '[' ? Is it proper way to add every possible case separated? Do you know better way to handle this problem?

Kind Regards,
Adrian
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Paul Filkin over 5 years ago in reply to Adrian Wojewoda
Adrian Wojewoda

Adrian Wojewoda said:
however it seems to work only when "[...]" shows at the beginning of content.

Correct... that's because I based the expression on your simple example only.

Adrian Wojewoda said:
What in case when I got more complicated sentence that contain more bracket text? Like in that example:

<sub-section id="d">[value_text] sample content [value_text] sample content [value_text]<sub-section>

Should I add some RegEx formula to not translate text between ']' and '[' ? Is it proper way to add every possible case separated?

This starts to get tricky for several reasons:

you would really need multiple rules for each case

even with multiple rules it would still be hard... is this possible for example?
<sub-section id="d">[value_text] sample content [value_text] sample content [value_text] sample content [value_text] sample content [value_text]<sub-section>

then you also have to deal with segmentation because how would this be to translate without proper segmentation?
[value_text][value_text][value_text][value_text][value_text]

Adrian Wojewoda said:
Do you know better way to handle this problem?

Tell us more about the file as a whole.

does the rest of the file need to be translated?

are the texts in the square brackets consistent and repeat themselves?

Without the whole story it's very difficult to do what you are suggesting or to try and offer a sensible solution.

Paul Filkin | RWS

Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Adrian Wojewoda over 5 years ago in reply to Paul Filkin

Paul,

I have a very big .xml file that contains a lot of cases like in example.
I figured out how to find only required content using Xpath like in my first post.

Rest of file shouldn't be translated. Only specific cases. For example: "Please translate ONLY text in brackets (rest of text should be ignored by studio) localized in section tag ="5" and sub-section id="c". Text is random and inconsistent. They don't repeat themselves.

For better representation this should be more clear:

<section tag="2">
<sub-section id="e">Lorem ipsum [text 1] dolor sit amet, consectetur adipiscing elit. [text 2] Integer id ullamcorper magna,...</sub-section>
</section>

My goal is get in studio (in this case) 2 segments:

text 1

text 2

I am trying to automate this formula for every case in file. So there are only few cases that can appear in content:

[text] normal text
normal text [text]
normal text [text] normal text
normal text
[text] normal text [text]
[text] normal text [text] normal text [text]

or

[text][text][text]

Am I need to create all RegEx formula for each separately?

This is not actual translation, It is kind of Localization skills test that shows if there is possibility to resolve specific problem.
I am trying to improve my studio skills. Xpath and regular expressions seems to be core of good Localization knowledge.

Kind Regards,
Adrian
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Reply

0 Adrian Wojewoda over 5 years ago in reply to Paul Filkin

Paul,

I have a very big .xml file that contains a lot of cases like in example.
I figured out how to find only required content using Xpath like in my first post.

Rest of file shouldn't be translated. Only specific cases. For example: "Please translate ONLY text in brackets (rest of text should be ignored by studio) localized in section tag ="5" and sub-section id="c". Text is random and inconsistent. They don't repeat themselves.

For better representation this should be more clear:

<section tag="2">
<sub-section id="e">Lorem ipsum [text 1] dolor sit amet, consectetur adipiscing elit. [text 2] Integer id ullamcorper magna,...</sub-section>
</section>

My goal is get in studio (in this case) 2 segments:

text 1

text 2

I am trying to automate this formula for every case in file. So there are only few cases that can appear in content:

[text] normal text
normal text [text]
normal text [text] normal text
normal text
[text] normal text [text]
[text] normal text [text] normal text [text]

or

[text][text][text]

Am I need to create all RegEx formula for each separately?

This is not actual translation, It is kind of Localization skills test that shows if there is possibility to resolve specific problem.
I am trying to improve my studio skills. Xpath and regular expressions seems to be core of good Localization knowledge.

Kind Regards,
Adrian
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Children

+1 Paul Filkin over 5 years ago in reply to Adrian Wojewoda
Adrian Wojewoda

In this case you'd be better of doing something like this:

create your filetype with the xpath expression previously agreed

Use this expression to create a placeholder instead of the tag pair:
(?<!\[)\b[\w\s]+\b(?![\)])

This will select everything apart from the text in the brackets... like this for example where I even used a really extreme example:

And if you then set the embedded content rule to "exclude" you can even get the segmentation:

Looks like it's what you needed?

Paul Filkin | RWS

Design your own training!
You've done the courses and still need to go a little further, or still not clear?
Tell us what you need in our Community Solutions Hub
Generated Image Alt-Text
[edited by: Trados AI at 4:31 AM (GMT 0) on 5 Mar 2024]
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Reject Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Adrian Wojewoda over 5 years ago in reply to Paul Filkin

Paul,

That almost it. To make this easier I going to paste here some real examples. Text is totally random and doesn't have much sense. Getting text from brackets is goal:

I have create XML file type with (Legacy embedded content).

Here are parsers:

And embedded content for Paragraph;

Everything seem to be correct, However this formula doesn't recognise digits and non-Word characters.

Result:

Adding \d attribute to this RegEx should resolve part of missing digits.
Dots and commas are bigger trouble for me. Is \W enough attribute to handle with them?

Kind regards,

Adrian
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Trados Studio > 5. Regex and XPath

Xpath parser to exclude part of text from contetn

Top Replies