Processing bilingual PDF file

Dear Community!

is there any roundabout / plugin/miracle remedy for a non-scanned PDF file decent processing.....?

The problem is: PDF was received in 2 languages. The task is: replace only English texts.

I have done the following:

1. Convert back into Word and try to detect and highlight the French text in Word, for its further filtering in Trados. The coversion result was already a disaster....

2. Upload to Trados and filter the curvisa text (English). Impossible....

Is there any possibility of preparing and translating these docs??? 

Infinitely grateful in advance! 

emoji
Parents
  •  

    I think it's probably one of those things where the client needs to understand the process if they cannot provide translatable source files and be expected to pay for it.  I would look at a process like this:

    1. Use ABBYY FineReader PDF (or another OCR/PDF parser) and export to XML.  FineReader XML gives you blocks, paragraphs, lines, characters, bounding boxes, and often xml:lang.

    2. Design a lightweight schema that suits translation:
      1. Assign IDs per segment.
      2. Add lang and translate attributes.
      3. Only segments with translate="yes" are extracted into your translation workflow.
      4. use a stylesheet to preview the full text if important

    3. Translate the file, then save the target and push into a Word/LaTeX/HTML template (or back through FineReader if you want PDF).

    4. Export to PDF with formatting.

    It's unlikely to be perfect and you'll have some tidy up, so Word might be the better format to push into.  But at least the translation process will be less problematic and clean.

    Or... just ask the client to prepare proper source files for you!  They could do this themselves and just pay for the translation.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  • Dear Paul....

    what do you mean by " design a lightweight schema" ??? Load this to Trados??:

    Screenshot showing a dense block of text in blue background, containing technical details and parameters related to a product, with no visible errors or warnings.

    And, of course, would have been the easiest issue on earth if they just sent us text files....they are not even scanned!

    emoji


    Generated Image Alt-Text
    [edited by: RWS Community AI at 10:50 AM (GMT 1) on 11 Sep 2025]
  •  

    No... something like this:

    <document>
      <page number="1">
        <segment id="seg001" lang="fr" translate="no">
          Le DHS (« Data Handling System ») est un équipement…
        </segment>
        <segment id="seg002" lang="en" translate="yes">
          The Data Handling System (DHS) is an instrument for…
        </segment>
        <segment id="seg003" lang="fr" translate="no">
          Il permet notamment de :
        </segment>
        <segment id="seg004" lang="en" translate="yes">
          In particular, it serves to:
        </segment>
      </page>
    </document>
    

    When you convert to XML you need to decide what this will look like.  So it might be something like this.

    This is only a suggestion, and it means you need to do the conversion with a sensible OCR tool, like Abbyy.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  • Ok, I get your point...

    But the result I got from Adobe Pro (maybe not that sensible) is the one I posted above...it is not broke down in lines :

    Adobe Acrobat Pro interface showing the 'Convert' tab with options to export a PDF to Microsoft Word, PowerPoint, Excel, image formats, or other formats like RTF, XML, and HTML.

    But then, the " translate-not translate" must be assigned manually? Or the program is supposed to detect the lang.? 

    Looks like this mission is impossible....

    emoji


    Generated Image Alt-Text
    [edited by: RWS Community AI at 11:56 AM (GMT 1) on 11 Sep 2025]
Reply Children
No Data