How to parse HTML source text to extract raw text for machine translation and reconstruct HTML structure around translations?

Context:
I have developed a custom translation engine (CustomTranslationEngine) that is being hosted as Azure Functions App and integrated on the Trados platform and is working perfectly fine. I can create language translation projects and use CustomTranslationEngine seamlessly! The documentation related to the translation endpoint can be found at = https://sdl-language-tech.stoplight.io/docs/addonapi/f1f3a0b82647d-translate

Trados platform supports various file formats to upload, such as: PDF, XML, CSV, PPT, and many more. Before sending HTTP request to my custom translation engine, Trados creates an HTML tree structure out of the uploaded file, whether the uploaded file is as PDF, PPT, XML, or whatsoever. My CustomTranslationEngine on Azure Functions App gets translation text in the form of HTML string as the body of HTTP POST Request, shown below:

[
"<div id='1'>Demo 2</div>",
"<div id='2'>This is not just for testing, this is a real demo</div>",
"<div id='3'>Please be aware that your translated content can contain formatting and sometimes, this may have to be applied somewhere else in the target text!</div>"
]

As you can see, the contents are a list of strings, and on each index, I get string of HTML content. The original plain text is wrapped around div and span tags with unique ids to preserve text format, location etc in the original uploaded file. I believe this is done to write back the translations to output file in the exact same format and location of text as in the source file. For example, if the uploaded file was a PDF, then content of PDF file is converted to HTML structure to preserve location of original text and format etc. and it’s expected to get the output PDF file with translated text in original format of the text along with location etc.

Problem:

I have fine-tuned Llama2 model on domain specific data of raw source and target sentences. I want to use Llama2 for inference as currently CustomTranslationEngine is only using Azure Translator and Azure Custom Translator Service. The Azure Translator Service doesn’t need custom parsing solution as it accepts HTML text and understands it by default. Whereas, to do inference using Llama2 fine-tuned model, the challenge is to write custom HTML parser that can extract source text and return translations with reconstructed HTML structure around the translations!. I have written my custom HTML parser using Beautiful SOAP, which extracts source text from HTML structure and reconstructs HTML structure around translations to send it back to Trados. However, this solution is not efficient when there is a complex HTML structure with deep nested span tags.

One example is as following:

English: This is how AI is driving companies forward in Germany.

German: So bringt KI Firmen in Deutschland voran.

as you see, a reliable translation and keeping formatting isn't always straightforward. The source text has contiguously bold text (HTML structure of this sentence will have it in a div and span tags for bold text with unique IDs) but, in its translation, the bold texts might appear at different locations on the target language. So, a simple solution of extracting source text and just swapping its raw translation won't efficiently work as general solution!

Questions:

Is there an existing parsing solution from Trados in the form of utility that can be imported in my custom translation engines to solve this issue?
Is there a simpler format option than HTML that still preserves formatting metadata but is easier to parse?
How others have addressed this issue of HTML format preservation?

Thank you very much in advance for your time.

Translate

Rate translation

Suggest better translation

Top Replies

Mihai Dipsan 3 months ago in reply to Amir Abu Jandal +1

Indeed, it takes a lot of trial and error to come up with a prompt that yields results with a high success rate. You can start by providing some good examples of the source HTML along with the translated…

Parents

0 Mihai Dipsan 3 months ago

Hi, if I understand correctly you are using the Llama 2 model to translate the plain text from the received HTMLs. In that case, wouldn't it be easier to instruct the model to translate the whole HTML segment? For instance, you could ask it to preserve the HTML structure. This way you don't need to extract the plain text and the translation would already be in HTML format.

Also, to answer your questions:

1. As far as I know, no.

2. MT extensions support `html` or `bcm`(https://eu.cloud.trados.com/lc/extensibility-docs/3b200f7686e0a-descriptor). If you choose `bcm` the whole BCM document will be sent to your app(https://eu.cloud.trados.com/lc/extensibility-docs/b32b80eaf407d-translate) but I suppose it is easier to work with HTMLs.

3. I don't know how every MT provider app handles the content but I think most of them rely on translation providers that support HTML.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Amir Abu Jandal 3 months ago in reply to Mihai Dipsan

Hi, thanks for your feedback.

Yes, you got it right. I am using LLama2 model (fine-tuned on my own data) for translations. However, during fine-tuning I had plain text, whereas, in production the live data has html tags around it. I wrote a parser to extract plain text out of the html structure, got translations and returned translations with html structure around it. The parser is not efficient as formatting is not preserved as it should be!

I also tried to give LLama2 instructions at inference time to preserve the html structure, just as you have described in your answer. However, this doesn't work all the time. Especially when I have a highly complex and nested html structure. Do you have experience with making the LLM output completions in a structured format that never goes wrong?

Thanks for your time.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate
0 Mihai Dipsan 3 months ago in reply to Amir Abu Jandal

Indeed, it takes a lot of trial and error to come up with a prompt that yields results with a high success rate. You can start by providing some good examples of the source HTML along with the translated version and then work on improving it over time by expanding the instructions. Also, there are a lot of prompt engineering techniques to take inspiration from.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Reply

0 Mihai Dipsan 3 months ago in reply to Amir Abu Jandal

Indeed, it takes a lot of trial and error to come up with a prompt that yields results with a high success rate. You can start by providing some good examples of the source HTML along with the translated version and then work on improving it over time by expanding the instructions. Also, there are a lot of prompt engineering techniques to take inspiration from.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel

Share
Documentation Survey: help us offer you better documentation! Translate

Children

No Data

Language Cloud Developers > Forums

How to parse HTML source text to extract raw text for machine translation and reconstruct HTML structure around translations?

Top Replies