PDF conversion/import renders "ti" and "fi" characters incorrectly (ligature problem)

Hi all, 

For about a year now, certain character combinations are omitted or converted to ligatures when opening an editable PDF file in Word. I understand it's a font problem, so I've tried changing the "Ligatures" setting in "OpenType Features" in Word, but nothing has worked.

After reading Nichola Knusten's recent forum post on ligatures, I realise the same problem occurs when opening editable pdf files in Trados, hence this new post!

Specifically, "ti" is omitted and "fi" becomes a single character. See the words notificación and certificado in this screenshot:

  

I'm attaching a sample pdf file, in the hope that someone can help me solve the problem either in the Word conversion or in Trados.

PDF


emoji
  •  

    Dear Emma

    Please read this topic, as it deals with the same problem. Use a different font and the problem will be gone. Or do not display any formatting, only tags. This will also solve the problem.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

    emoji
  • Hi Jerzy, 

    Please re-read my post Wink  where I specifically mentioned Nichola Knusten's topic and linked to it. 

    I've tried different fonts and I've never worked with formatting displayed. Neither solve the problem, I'm afraid. 

    Please could you try opening my test PDF file in Studio? It would be interesting to see how it renders in your Editor window.

    emoji
  •  

    Sorry for being too quick.

    Unfortunately, I cannot download the PDF, just view it. Would you please send it to me at info (at) tts-td dot com?

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

    emoji
  • Thanks for looking,   . If you view the pdf, there should be a download button in the top right-hand corner. If you don't have that button, let me know and I'll send it to you by email.

    emoji
  •  

    The button is there, but seems not to work, at least not in Firefox.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

    emoji
  •   

    The button is there, but seems not to work, at least not in Firefox.

    Thank you for the file. The conversion happens already before Trados Studio, the ligatures are already in the Word file before opening in Studio:

    Screenshot of a text document with ligature issues, showing the 'fi' characters merged incorrectly, highlighted with red circles.

    Of course they remain there in TS. So the problem starts before. However, the ligatures remain in Word even when I change the font there. My solution (for all PDFs, TBH) is never to use TS for converting. I either convert by exporting from Acrobat or use OCR.

    _________________________________________________________

    When asking for help here, please be as accurate as possible. Please always remember to give the exact version of product used and all possible error messages received. The better you describe your problem, the better help you will get.

    Want to learn more about Trados Studio? Visit the Community Hub. Have a good idea to make Trados Studio better? Publish it here.

    emoji


    Generated Image Alt-Text
    [edited by: RWS Community AI at 6:27 PM (GMT 0) on 16 Nov 2024]
  • Thank you for looking,  . I never use Trados for PDFs either, but I thought I would try it this time in case it handled the conversion from pdf differently from Word.

    For editable files (like this one), I either use "export to" in PDF Xchange Editor or "open" in Word. With both methods I get the omitted characters (NB, "ti" is rendered as a space in your Word file above) and ligatures for "fi".

    It's quite frustrating. 

    emoji
  •  

    This is a very long-standing issue (apparently) without solution as of now. Might be easiest to have a custom find-replace routing. Export a Hunspell dictionary and replace all ligatures with \s+ like "no\s+ficatión" -> "notificatión". Something along these lines.

    Or you give it to Almighty Almighty AI (NDA?):

    Screenshot of a conversation about a Spanish text with ligature issues in Trados Studio. The original text with spacing errors is quoted, followed by a corrected version of the text.

    emoji


    Generated Image Alt-Text
    [edited by: RWS Community AI at 9:26 AM (GMT 0) on 18 Nov 2024]
  • Thanks for this,  . GenAI indeed works brilliantly for this sort of thing, but it's impractical for longer or formatted text and, as you say, has confidentiality issues. 

    Your other suggestion for the find-replace routine is interesting. I don't use Hunspell, but I've now set up a batch F&R in TransTools, which works perfectly for these ligatures in Word, BUT Trados continues to render them as ligatures. No idea why.

    Unfortunately, find-replace doesn't work for omissions (a single space added in Word instead of the "ti" ligature), for obvious reasons.

    Investigating further, I now understand better why this happens. It's only with PDF files created from Word docs in Calibri. Ligature characters originally in Calibri and rendered as a single space when converted (specifically ti and tt) happens bc Calibri uses non-standardised UNICODE codepoints. (See the comment by Bevi Chagnon in this Adobe forum thread).

    Another non-tech, longer-term solution, might be to stop manually correcting these ligature problems in Word, and start adding segments with incorrectly-rendered ligatures to my TM. Sooner or later, I'll start getting 100% or fuzzy matches for them. Grinning 

      


    emoji
  •  

    You don't need to use Hunspell, you just need a dictionary of Spanish words, then you can create a CSV with [word with ligature character missin],[correct word].

    Initially you'd filter for words where that can occur, i.e. words containing "ti", "fi" etc.

    Then you copy each word to get [notification],[notification]

    Then, in the first column, you replace the ligature-prone letter combination with "\s+".

    Then you have your find/replace patterns in a CSV and just need a tool to apply them. CleanUp Task might be such a tool, or some script. Whether this is worth the effort depends on how often you encounter this, I guess.

    emoji
1 2