PDF Issue: Layout of Translated File & Recognition of Source Text

Former Member
Former Member

Hi all,

I am trying to translate an original (editable) PDF on Trados from English into Greek. However, two major issues arise. First, many of the characters of the source text are not recognized properly in Trados when the file is imported. As a result, the source text on my Editor appears as gibberish in many segments. For example, the character x is replaced with - while w is replaced with x.

Screenshot of Trados Studio Editor showing text recognition errors, with characters like 'x' replaced with '-' and 'w' replaced with 'x'.

An example of bad text recognition

Second, when I generate the translation, the translated file is totally ruined in terms of layout. For example, much of the text of the 1st page has been moved to the 2nd page, the segmentation has been ruined, and the spacing is terrible (see images below).

Screenshot of the original PDF layout with proper text alignment and spacing on the first page.

The layout of the source text

Screenshot of the translated PDF layout on the first page with text misalignment and spacing issues.

The layout of the target file (1st page)

Screenshot of the translated PDF layout on the second page showing text overflow from the first page and disrupted segmentation.

The layout of the target file (2nd page)

In general, I'm always facing layout issues when it comes to translating PDF files. What would be your suggestion on preventing these issues, if possible?

Kind regards,

Christos



Generated Image Alt-Text
[edited by: Trados AI at 8:39 PM (GMT 0) on 28 Feb 2024]
emoji
  • The suggestion is simple - translate the source document from which the PDF was created, not the PDF.

    Unfortunately, clients are clueless and do not know that PDF is NOT "just another document format" which can be freely edited.
    And it's even more unfortunate that many (majority?) translators do not know this either...

    PDF was invented as consume-only (i.e. read-only, print-only) format and was NEVER intended to be editable or otherwise processable.

    Therefore, if client wants to have a PDF localized, the only correct process is to localize the original format used to create the PDF (which can be Word, InDesign, Quark, or whatever else), NOT the PDF.
    Period.

  • I must disappoint you - there is no way to get it better, if you do not invest any work. And you have to invest this work upfront. Even if it is possible to translate a PDF with a CAT tool directly, this is a very bad idea. You have already learned why - the conversion is as it is. You have no influence on what is being converted how.

    Either insist on translating the native format of the document before it was PDF or use a decent OCR and convert the PDF manually. Then pay attention, that the fonts used do cover your target language. Expecting any automated tool to be able to provide you perfect conversion quality is - forgive my French - at least naive.

    If you want to learn more about Studio and PDF, watch this upcoming webinar: http://seminare.bdue.de/4705

    BTW, the term "editable" PDF is very misleading. No PDF is "editable", as the format has been entirely developed for READ-ONLY applications. It is not intended to be edited in any way. So what you mean is a "clickable" PDF, where you can click and select text. If the PDF is not protected, you can also copy the text. The best idea would be simply to copy all the text into a notepad to remove all formatting, translate this, apply the basic formatting like headings, list elements and so on and deliver this to the customer to make his layouter copy & paste it, if the customer will not deliver the original source file. Or to reformat it yourself.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Fun fact: Looking closer at the PDF screenshot I realized (only after posting my answer) that the "customer" is actually Christos himself ;-)

  • I'm afraid the advice you're getting s probably the right advice.  PDF files can be notoriously difficult to manage and even if the conversion to Word for the translatable file goes well the layout is easily lost if it's complex as Word isn't the best tool for things like this.

    What software did you use for the original file... maybe there is a way to get at the content another way?

    You could also try using IRIS.  Make sure you have the IRIS plugin installed and activated before you create your project.  It "might" help.

    https://multifarious.filkin.com/2017/08/17/iris-ocr/

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Fully agree - I tend to often open scanned documents just to have a basis for setting up the source text.(we get a lot of notarised copies of documents about capital Instruments (often of a very historical basis). I still find that taking the time to run OCR and then correct the source text formatting in the generated word doc (I tend to generate a target language version that is based on the output of the OCR and then reformat and correct the Word Doc it creates - it is a lot quicker than retyping the document.

  • If you consider that bad, wait until you get items in Fraktur from about 1910 and scanned as a PDF. Slight smile

  • Former Member
    0 Former Member

    Hi all,

    Thank you for the help. Some clients do not have the original file (before converting it to PDF), that's why I'm looking for a workaround. Also, when exporting a file from some applications, only the PDF is available. For this example, I used the PDF which I downloaded from Resume.io , a website where one can create their CV on pre-existing templates. This is why I was the hypothetical client in this case. Well spotted!

    Regarding OCR, which software would you suggest for PDFs?

    Kind regards,

    Christos

  • For this example, I used the PDF which I downloaded from Resume.io

    Screenshot of a FAQ section with the question 'Can I download my resume to Word or PDF?' Highlighted options for download formats include PDF, DOCX (beta version), and TXT.

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 8:40 PM (GMT 0) on 28 Feb 2024]
  • Yes, in some cases it is indeed PDF and a pain... In such cases you have these options:

    • Export directly to Word from Adobe Acrobat (full version)
    • Use a decent OCR software like Abbyy Finereader - but NOT the automated one, only the manual process to identify tables, pictures, background pictures and so on; an automated conversion will deliver a mess (nearly always)
    • Find someone who is proficient in converting PDF2Word, pay him/her for this work and be happy with a perfect result
    • In any case: charge the customer for the extra work

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

  • Former Member
    0 Former Member

    Thank you for your input, everyone. That was really helpful, as I I knew very little about PDFs.

    All best,

    Christos