Wordpress XML - cleaning up with a regex but how to apply

Hi Community,

I have received a Wordpress XML file and creating a custom XML for it.

It looks good so far, the parsing works, but it has HTML embedded content, and it is hard to clean it up.

It has some of the below content:

************************************
<strong> Mise en œuvre</strong>
L'ensemble des notions théoriques seront illustrées par des cas concrets sur le logiciel elec calcTm.
[/av_textblock]

[av_heading tag='h3' padding='10' heading='Dates' color='' style='blockquote modern-quote' custom_font='' size='' subheading_active='' subheading_size='15' custom_class='' admin_preview_bg='' av-desktop-hide='' av-medium-hide='' av-small-hide='' av-mini-hide='' av-medium-font-size-title='' av-small-font-size-title='' av-mini-font-size-title='' av-medium-font-size='' av-small-font-size='' av-mini-font-size='' margin=''][/av_heading]

[av_textblock size='16' font_color='' color='' av-medium-font-size='' av-small-font-size='' av-mini-font-size='' av_uid='av-50tak4l' admin_preview_bg='']
<span style="color: #000000;"><strong>Les 24, 25 et 26 novembre de 9h00 à 12h30 et de 14h00 à 17h30</strong></span>
[/av_textblock]

[av_heading tag='h3' padding='10' heading='Programme' color='' style='blockquote modern-quote' custom_font='' size='' subheading_active='' subheading_size='15' custom_class='' admin_preview_bg='' av-desktop-hide='' av-medium-hide='' av-small-hide='' av-mini-hide='' av-medium-font-size-title='' av-small-font-size-title='' av-mini-font-size-title='' av-medium-font-size='' av-small-font-size='' av-mini-font-size='' margin=''][/av_heading]

[av_textblock size='16' font_color='' color='' av-medium-font-size='' av-small-font-size='' av-mini-font-size='' av_uid='av-juva8v6m' admin_preview_bg='']

********************************

I am a newbie to regex, but with testing I came up with the below to filter out the [av...] bits (works in RegExr):

\[[a-z\s\S]+\]

Trados Studio error message displaying issues with HTML embedded content in an XML file.

However, I cannot seem to add this to Studio. Do I have to add this to the HTML embedded content processor?
Studio seems to have a problem with it each time I add it here (something is missing, have to add an attribute, etc):
Trados Studio Project File Type Settings showing HTML 5 Parser configurations.

Can you please advise? Added the source file for reference.

Thank you!
Greta

export-tsi-page-octobre-2020.xml



Generated Image Alt-Text
[edited by: Trados AI at 4:32 AM (GMT 0) on 5 Mar 2024]
emoji
Parents
  • As usual Wordpress do a really poor job of exporting even XML for translation!

    You may have to define more rules if you want to do this in the filetype by setting it up like this for example:

    Trados Studio options menu showing Embedded Content Processor settings with a defined rule for XML content.

    Then use the text delimited ECP like this (I only added one more rule):

    Trados Studio options menu displaying Text Delimited ECP settings with a list of added rules for text processing.

    And now you have this sort of thing:

    Trados Studio preview window comparing source and target text with highlighted translation differences.

    Alternatively, and this may be a better solution for you.  Stick to your current approach but then use the SDL Data Protection Suite to handle the stuff in the square brackets... like this:

    Trados Studio batch processing settings window with a rule set up for SDL Data Protection Suite to handle square brackets.

    Now you have this:

    Trados Studio preview window showing a detailed view of translated text with formatting tags and highlighted sections.

    Maybe a bit easier to handle like this.  You just need to remember to remove the protection before you save the target file.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 4:32 AM (GMT 0) on 5 Mar 2024]
Reply
  • As usual Wordpress do a really poor job of exporting even XML for translation!

    You may have to define more rules if you want to do this in the filetype by setting it up like this for example:

    Trados Studio options menu showing Embedded Content Processor settings with a defined rule for XML content.

    Then use the text delimited ECP like this (I only added one more rule):

    Trados Studio options menu displaying Text Delimited ECP settings with a list of added rules for text processing.

    And now you have this sort of thing:

    Trados Studio preview window comparing source and target text with highlighted translation differences.

    Alternatively, and this may be a better solution for you.  Stick to your current approach but then use the SDL Data Protection Suite to handle the stuff in the square brackets... like this:

    Trados Studio batch processing settings window with a rule set up for SDL Data Protection Suite to handle square brackets.

    Now you have this:

    Trados Studio preview window showing a detailed view of translated text with formatting tags and highlighted sections.

    Maybe a bit easier to handle like this.  You just need to remember to remove the protection before you save the target file.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 4:32 AM (GMT 0) on 5 Mar 2024]
Children