How to extract text between specific tag pair from xml files

We're trying ot process big batches of xml files created using Arbortext. 

We can process them easily using Trados (2022 and 2024), but we ca't find a way to extract "preview texts" included between the following tags:

Screenshot showing XML code with a 'Pub_previewtext' tag containing the text 'Installare il distributore tappi'.

These tags are often embedded in href or xref tags, but we couldn't identify a specific root element or path.

Any attempt to create custom file types failed so far.

Any ideas on how to solve this? 



Generated Image Alt-Text
[edited by: RWS Community AI at 4:35 PM (GMT 0) on 18 Nov 2024]
emoji
  •  

    A bit tricky as the standard XML file types will not extract text from these tags when they are embedded within attributes or nested in an inconsistent structure.  If you don't have a developer in-house who could create a custom XML filetype for you using the Filetype API then you might be able to handle them with preprocessing techniques using XSLT or scripting to isolate the desired content, then a similar post-processing technique afterwards to put the files back together again.

    Another option might be to handle all the content using the Embedded Content Plain Text processor.  But this would mean regex rules for everything and this is really only helpful if your files are relatively simple.

    You could also have a play with the Multilingual XML filetype as this has a feature to handle an XML file as a monolingual file.  That way you get the benefit of its ability to have XPath processing AND embedded content processing.  The downside with this approach is it's not as flexible as the XML filetype in Studio where you can handle the XPath with a lot more flexibility, but that one has no embedded content processing.

    So much depends on the files you have and which approach is likely to work best.

    Paul Filkin | RWS

    Design your own training!
    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji