How to parse HTML source text to extract raw text for machine translation and reconstruct HTML structure around translations?

Context:
I have developed a custom translation engine (CustomTranslationEngine) that is being hosted as Azure Functions App and integrated on the Trados platform and is working perfectly fine. I can create language translation projects and use CustomTranslationEngine seamlessly! The documentation related to the translation endpoint can be found at = https://sdl-language-tech.stoplight.io/docs/addonapi/f1f3a0b82647d-translate

Trados platform supports various file formats to upload, such as: PDF, XML, CSV, PPT, and many more. Before sending HTTP request to my custom translation engine, Trados creates an HTML tree structure out of the uploaded file, whether the uploaded file is as PDF, PPT, XML, or whatsoever. My CustomTranslationEngine on Azure Functions App gets translation text in the form of HTML string as the body of HTTP POST Request, shown below:

[
"<div id='1'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>Demo 2</span></div>",
"<div id='2'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>This is not just for testing, this is a real demo</span></div>",
"<div id='3'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>Please be aware that your translated content can contain </span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>formatting</span></span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'> </span></span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>and </span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>sometimes,</span></span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'> </span></span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>this may have to be applied  </span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>somewhere </span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>else</span></span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'> </span></span><span id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'>in the target text!</span></div>"
]

As you can see, the contents are a list of strings, and on each index, I get string of HTML content. The original plain text is wrapped around div and span tags with unique ids to preserve text format, location etc in the original uploaded file. I believe this is done to write back the translations to output file in the exact same format and location of text as in the source file. For example, if the uploaded file was a PDF, then content of PDF file is converted to HTML structure to preserve location of original text and format etc. and it’s expected to get the output PDF file with translated text in original format of the text along with location etc.

 

Problem:

I have fine-tuned Llama2 model on domain specific data of raw source and target sentences. I want to use Llama2 for inference as currently CustomTranslationEngine is only using Azure Translator and Azure Custom Translator Service. The Azure Translator Service doesn’t need custom parsing solution as it accepts HTML text and understands it by default. Whereas, to do inference using Llama2 fine-tuned model, the challenge is to write custom HTML parser that can extract source text and return translations with reconstructed HTML structure around the translations!. I have written my custom HTML parser using Beautiful SOAP, which extracts source text from HTML structure and reconstructs HTML structure around translations to send it back to Trados. However, this solution is not efficient when there is a complex HTML structure with deep nested span tags.

 

One example is as following:

English: This is how AI is driving companies forward in Germany.

German: So bringt KI Firmen in Deutschland voran.

 

as you see, a reliable translation and keeping formatting isn't always straightforward. The source text has contiguously bold text (HTML structure of this sentence will have it in a div and span tags for bold text with unique IDs) but, in its translation, the bold texts might appear at different locations on the target language. So, a simple solution of extracting source text and just swapping its raw translation won't efficiently work as general solution!

 

Questions:

  1. Is there an existing parsing solution from Trados in the form of utility that can be imported in my custom translation engines to solve this issue?
  2. Is there a simpler format option than HTML that still preserves formatting metadata but is easier to parse?
  3. How others have addressed this issue of HTML format preservation?

Thank you very much in advance for your time.

emoji
Parents Reply Children
No Data