How to parse such a file?

<?xml version="1.0" encoding="UTF-8" ?>
<node_export created="Fri, 12 Feb 2016 15:51:53 +0100">
  <node>
    <vid>4519</vid>
    <uid>299</uid>
    <title>Fach- und Führungskräfte</title>
    <log></log>
    <status>0</status>
    <comment>0</comment>
    <promote>0</promote>
    <sticky>0</sticky>
    <vuuid>2d673cdd-0135-43e2-901d-171422842981</vuuid>
    <nid>4513</nid>
    <type>page</type>
    <language>de</language>
    <created>1454318704</created>
    <changed>1455284011</changed>
    <tnid>0</tnid>
    <translate>0</translate>
    <uuid>ed24dede-967e-4a58-9c3a-d6106baece24</uuid>
    <revision_timestamp>1455284011</revision_timestamp>
    <revision_uid>299</revision_uid>
    <body>
      <und _numeric_keys="1">
        <n0>
          <value><![CDATA[<h3>Einstieg in ein familiengeführtes und international tätiges Unternehmen &ndash; AL-KO macht&acute;s möglich</h3><p>AL-KO bietet als weltweit tätiges Unternehmen in den Geschäftsfeldern <strong>Fahrzeugtechnik, Garten + Hobby</strong> <strong>sowie</strong> <strong>Lufttechnik</strong> viele Möglichkeiten für Berufserfahrene.<br />Mit einer über 80jährigen Unternehmensgeschichte und rund 4.000 Mitarbeitern an mehr als 45 Standorten sind wir, die AL-KO KOBER GROUP, ein Arbeitgeber mit vielen Perspektiven.<br />Wir suchen nach kreativen Köpfen mit cleveren Ideen, die unser Unternehmen weiter vorantreiben.</p><p>Bewerben Sie sich als<strong> Fach- und Führungskraft</strong>. Auf folgenden Seiten können Sie sich über einen Einstieg informieren - machen Sie Ihren Weg bei uns!</p>]]></value>
          <summary></summary>
          <format>wysiwyg_text</format>
        </n0>
      </und>
    <field_product_technical_data>
      <und _numeric_keys="1">
        <n0>
          <value><![CDATA[a:50:{s:8:"cell_0_0";s:4:"Typ ";s:8:"cell_0_1";s:8:"Gewicht ";s:8:"cell_0_2";s:8:"Art.Nr. ";s:8:"cell_0_3";s:7:"UVP €";s:8:"cell_1_0";s:4:"M20*";s:8:"cell_1_1";s:5:"32 kg";s:8:"cell_1_2";s:7:"1730367";s:8:"cell_1_3";s:8:"2.249,00";s:8:"cell_2_0";s:4:"S21*";s:8:"cell_2_1";s:5:"42 kg";s:8:"cell_2_2";s:7:"1730368";s:8:"cell_2_3";s:8:"2.349,00";s:8:"cell_3_0";s:4:"S22*";s:8:"cell_3_1";s:5:"42 kg";s:8:"cell_3_2";s:7:"1730369";s:8:"cell_3_3";s:8:"2.349,00";s:8:"cell_4_0";s:7:"TM400**";s:8:"cell_4_1";s:5:"67 kg";s:8:"cell_4_2";s:7:"1730287";s:8:"cell_4_3";s:8:"3.990,00";s:8:"cell_5_0";s:7:"TM410**";s:8:"cell_5_1";s:5:"77 kg";s:8:"cell_5_2";s:7:"1730288";s:8:"cell_5_3";s:8:"4.090,00";s:8:"cell_6_0";s:7:"TM420**";s:8:"cell_6_1";s:5:"77 kg";s:8:"cell_6_2";s:7:"1730289";s:8:"cell_6_3";s:8:"4.090,00";s:8:"cell_7_0";s:7:"TM401**";s:8:"cell_7_1";s:5:"74 kg";s:8:"cell_7_2";s:7:"1730238";s:8:"cell_7_3";s:8:"4.090,00";s:8:"cell_8_0";s:7:"TM402**";s:8:"cell_8_1";s:5:"74 kg";s:8:"cell_8_2";s:7:"1730054";s:8:"cell_8_3";s:8:"4.090,00";s:8:"cell_9_0";s:7:"TS411**";s:8:"cell_9_1";s:5:"84 kg";s:8:"cell_9_2";s:7:"1730237";s:8:"cell_9_3";s:8:"4.190,00";s:9:"cell_10_0";s:11:"TS412/421**";s:9:"cell_10_1";s:5:"84 kg";s:9:"cell_10_2";s:7:"1730233";s:9:"cell_10_3";s:8:"4.190,00";s:9:"cell_11_0";s:7:"TS422**";s:9:"cell_11_1";s:5:"84 kg";s:9:"cell_11_2";s:7:"1730049";s:9:"cell_11_3";s:8:"4.190,00";s:7:"rebuild";a:3:{s:10:"count_cols";i:4;s:10:"count_rows";i:12;s:7:"rebuild";s:13:"Rebuild Table";}s:6:"import";a:2:{s:45:"tablefield_csv_field_product_technical_data_0";s:0:"";s:38:"rebuild_field_product_technical_data_0";s:10:"Upload CSV";}}]]></value>
          <format type="NULL"></format>
          <tabledata _numeric_keys="1">
            <n0 _numeric_keys="1">
              <n0>Typ </n0>
              <n1>Gewicht </n1>
              <n2>Art.Nr. </n2>
              <n3>UVP €</n3>
            </n0>
            <n1 _numeric_keys="1">
              <n0>M20*</n0>
              <n1>32 kg</n1>
              <n2>1730367</n2>
              <n3>2.249,00</n3>
            </n1>
            <n2 _numeric_keys="1">
              <n0>S21*</n0>
              <n1>42 kg</n1>
              <n2>1730368</n2>
              <n3>2.349,00</n3>
            </n2>
            <n3 _numeric_keys="1">
              <n0>S22*</n0>
              <n1>42 kg</n1>
              <n2>1730369</n2>
              <n3>2.349,00</n3>
            </n3>
            <n4 _numeric_keys="1">
              <n0>TM400**</n0>
              <n1>67 kg</n1>
              <n2>1730287</n2>
              <n3>3.990,00</n3>
            </n4>
            <n5 _numeric_keys="1">
              <n0>TM410**</n0>
              <n1>77 kg</n1>
              <n2>1730288</n2>
              <n3>4.090,00</n3>
            </n5>
            <n6 _numeric_keys="1">
              <n0>TM420**</n0>
              <n1>77 kg</n1>
              <n2>1730289</n2>
              <n3>4.090,00</n3>
            </n6>
            <n7 _numeric_keys="1">
              <n0>TM401**</n0>
              <n1>74 kg</n1>
              <n2>1730238</n2>
              <n3>4.090,00</n3>
            </n7>
            <n8 _numeric_keys="1">
              <n0>TM402**</n0>
              <n1>74 kg</n1>
              <n2>1730054</n2>
              <n3>4.090,00</n3>
            </n8>
            <n9 _numeric_keys="1">
              <n0>TS411**</n0>
              <n1>84 kg</n1>
              <n2>1730237</n2>
              <n3>4.190,00</n3>
            </n9>
            <n10 _numeric_keys="1">
              <n0>TS412/421**</n0>
              <n1>84 kg</n1>
              <n2>1730233</n2>
              <n3>4.190,00</n3>
            </n10>
            <n11 _numeric_keys="1">
              <n0>TS422**</n0>
              <n1>84 kg</n1>
              <n2>1730049</n2>
              <n3>4.190,00</n3>
            </n11>
          </tabledata>
        </n0>
      </und>
    </field_product_technical_data>
  </node>
</node>


The thing is:
//* is not translatable
//value is always translatable
And HTML5 is used to parse content in CDATA.

This works for the main parts of the file, but I have such parts like this crazy CDATA section, which would need to be parsed too. Elements like "Gewicht" and similar should however remain translatable... Or shall I use legacy?

_________________________________________________________

When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

Parents
  • Probably better to use legacy if you want to handle all the crazy stuff separated by commas. The best solution if you have to do a lot of this is probably to extract these sections and process them as a different filetype altogether, then put them back. But you'd need a localization engineer to tackle that one... maybe with perl or something else.

    Or just have lots of regex rules to slowly cut out what you don't want. Always going to be tricky though and probably best to keep the expressions short and concise to avoid overlap issues.

    Maybe someone like could offer some good advice here.

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Paul, it is done - thanks to you.
    I have added such elements:

    ";\w:\d:"\w+[_]\w+";\w:\d+:"
    ;\w+:\d+:"rebuild.+?}}
    \w:\d+:{\w:\d:"\w+[_]\w+";\w:\d+:"
    <br([ \/]+)?>
    <a[^>]*> </a>
    <img.+?>
    <span.*?> </span.*?>
    <p[^>]*> </p[^>]*>
    <script[^>]*> </script[^>]*>
    <[\w\s="-]+> </[\w\s="-]+>

    It looks like it worked for that file.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

Reply
  • Paul, it is done - thanks to you.
    I have added such elements:

    ";\w:\d:"\w+[_]\w+";\w:\d+:"
    ;\w+:\d+:"rebuild.+?}}
    \w:\d+:{\w:\d:"\w+[_]\w+";\w:\d+:"
    <br([ \/]+)?>
    <a[^>]*> </a>
    <img.+?>
    <span.*?> </span.*?>
    <p[^>]*> </p[^>]*>
    <script[^>]*> </script[^>]*>
    <[\w\s="-]+> </[\w\s="-]+>

    It looks like it worked for that file.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

Children
No Data