Studio 2015 + OCR functionality

"Easily and quickly translate PDFs

It can often be frustrating receiving an image or PDF to translate but the new built in OCR reader in Studio 2015 will make translating PDFs and images much easier, even if they have been created from a scanned document."

"Save a little extra time: Translate scanned PDFs
Working on uneditable PDF documents can be frustrating and timeconsuming. Studio 2015 lets you translate any PDF fie in Studio even if it’s created from a scanned document. The new built-in OCR functionality extracts the text and converts into a translatable file."

...


This is what the SDL advertisement says. But the reality is really, really sad and poor. I tried to convert many German, Czech and Slovak PDF files and the results of the OCR recognition were absolutely unusable! Instead of characters with accents you get some nonsensical characters, there are nonsensical spaces in the middle of the words etc. etc.

Will SDL solve this problem? Did SDL even test this functionality with other languages than English? My colleagues said, the only language OCR works with is English. 

I really feel cheated, because the OCR functionality was one of the TOP features in SDL advertisements. By the way, my 15 years old Version of FineReader gives me better recognition results than Studio 2015. So it is still easier to go the old, complicated way, i.e. convert PDF into TIFF, run the OCR recognition TIFF with FineReader and save the recognized text as a Word file.

The reason why I purchased the Studio 2015 upgrade was this functionality - which does not work yet at all.

I hope I get an answer from SDL people here. I posted a similar text one month ago - with no reaction from SDL.

Parents
  • Hi Adrian,

    Which post received no answer? Apologies if this is the case as I try to make sure all posts received a response where required.

    On this one, there really isn't anything SDL need to resolve here, or rather can resolve easily. Comparing this with FineReader isn't really "apples for apples" as FineReader is a dedicated tool specifically for handling PDF files and in my opinion is the best approach for anyone who has to handle PDF files on a regular basis.

    If you have simple PDF files, based on reasonably good documents in the first place, then you might get quite good results with the SDL filetype. But this is not a PDF editor, and you don't get any opportunity to edit the files before processing at all. My recommendation (if you don't have FineReader for example) to anyone is to process the files in Studio and then use the DOCX files in the translation process after tidying them up (check formatting, remove innapropriate hard breaks etc.)

    I think expecting Studio to be the perfect solution for all PDF files with no pre-processing is unrealistic. There's a good reason why even FineReader can't get if perfectly right every time and they dedicate their entire software for this.

    Regards

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi Paul,
    thank you for your response, my last post was written aprox. 26 day ago (https://community.sdl.com/products-solutions/language/translationproductivity/f/90/t/4704)
    I would like to accent, that my FineReader version 15 years old, it is a very simple and user unfriendly app, which was delivered with a Lexmark scanner. It does not support PDF files, I have to transform the PDF file into a graphic file first, e.g. jpeg or tiff, and then I can run the OCR process.
    The scanned pdf documents I tried to convert with Studio 2015 had a very good quality, very well recognizable with FineReader. I tried to make a pdf file from a screenshot too - that is I think the best quality you can get for a OCR recognition. But there is still the problem with the Czech, Slovak or German characters - not with some of them, but with ALL of the specific characters in the document. And there were many spaces in the middle of the words, etc. Sure, you can (and have to) change some wrong formatting or hand breaks etc. in recognized docs. But in my case it would take many hours to change all the wrong characters in my Czech, Slovak or German documents...
    In the file "SDL_pb_WhatsNew_Studio2015_EN_A4_hires_tcm73-83125.pdf" I downloaded on the SDL web page you say: "Working on uneditable PDF documents can be frustrating and timeconsuming. Studio 2015 lets you translate any PDF fie in Studio even if it’s created from a scanned document. The new built-in OCR functionality extracts the text and converts into a translatable file." This is a clear message in my opinion: "You can work with uneditable PDF´s in Studio 2015. Studio can extract the text a converts it into a file you can work with".
    If I should write it shortly and clearly: I did not expect Studio to be a PERFECT tool to handle PDF files. I´ve just expected a tool, that would work as SDL has described in his advertisements.
    For your guidance you find an example for the original Czech text copied from a news web page and the "recognized" text from a screenshot below. This is a typical result of the OCR recognition, even if I use a perfect PDF file.
    Regards,
    Adrian


    Témata: Týdeník Echo, ANO, Radmila Kleslová
    Politickou scénu rozvířila tento týden pražská organizace ANO, která v pondělí nevyjádřila důvěru vlastní primátorce Adrianě Krnáčové. O den později pak následovalo bouřlivé noční jednání předsednictva ANO za účasti pražských zastupitelů a přímé účasti Andreje Babiše i jeho pravé ruky (a současně šéfky pražského ANO) Radmily Kleslové, po němž se ukázalo, že si Adriana Krnáčová svůj post prozatím udrží. A to i přesto, že podle informací Týdeníku Echo vyzývala část zúčastněných Krnáčovou k rezignaci. Jaký má Radmila Kleslová vztah ke svému šéfovi Andreji Babišovi a jak uplatňuje svůj vliv v zákulisí ANO? I o tom je rozhovor Týdeníku Echo s první místopředsedkyní hnutí ANO Radmilou Kleslovou.

    Temata: Tydenik Echo, ANO, Radmila Kleslovó
    Politickou scśnu rozvićila tento tyden prażskś organizace ANO, kterś v pondóli nevyjśdżila dńveru vlastni primśtorce Adriane Krnśćovś. 0 den pozdeji pak nśsledovalo boućlivć noćnijednśni pćedsednictva ANO za ńćasti prażskych zastupitelń a pćimś ńćasti Andreje Babiśe ijeho pravś ruky (a soućasnś śćfky prażskśho ANO) Radmily Kleslovś, po nemż se ukśzalo, że si Adriana Krnśćovś svńj post prozatim udrżi. A to i pćesto, że podle informaci Tydeniku Echo vyzyvala ćśst zńćastnónych Krnśćovou k rezignaci. Jaky mś Radmila Kleslovś vztah ke svómu śśfovi Andreji Babiśovi a jak uplatńuje svńj vliv v zśkulisi ANO? Io tom je rozhovor Tydeniku Echo s prvni mistopćedsedkyni hnuti ANO Radmilou Kleslovou.
Reply
  • Hi Paul,
    thank you for your response, my last post was written aprox. 26 day ago (https://community.sdl.com/products-solutions/language/translationproductivity/f/90/t/4704)
    I would like to accent, that my FineReader version 15 years old, it is a very simple and user unfriendly app, which was delivered with a Lexmark scanner. It does not support PDF files, I have to transform the PDF file into a graphic file first, e.g. jpeg or tiff, and then I can run the OCR process.
    The scanned pdf documents I tried to convert with Studio 2015 had a very good quality, very well recognizable with FineReader. I tried to make a pdf file from a screenshot too - that is I think the best quality you can get for a OCR recognition. But there is still the problem with the Czech, Slovak or German characters - not with some of them, but with ALL of the specific characters in the document. And there were many spaces in the middle of the words, etc. Sure, you can (and have to) change some wrong formatting or hand breaks etc. in recognized docs. But in my case it would take many hours to change all the wrong characters in my Czech, Slovak or German documents...
    In the file "SDL_pb_WhatsNew_Studio2015_EN_A4_hires_tcm73-83125.pdf" I downloaded on the SDL web page you say: "Working on uneditable PDF documents can be frustrating and timeconsuming. Studio 2015 lets you translate any PDF fie in Studio even if it’s created from a scanned document. The new built-in OCR functionality extracts the text and converts into a translatable file." This is a clear message in my opinion: "You can work with uneditable PDF´s in Studio 2015. Studio can extract the text a converts it into a file you can work with".
    If I should write it shortly and clearly: I did not expect Studio to be a PERFECT tool to handle PDF files. I´ve just expected a tool, that would work as SDL has described in his advertisements.
    For your guidance you find an example for the original Czech text copied from a news web page and the "recognized" text from a screenshot below. This is a typical result of the OCR recognition, even if I use a perfect PDF file.
    Regards,
    Adrian


    Témata: Týdeník Echo, ANO, Radmila Kleslová
    Politickou scénu rozvířila tento týden pražská organizace ANO, která v pondělí nevyjádřila důvěru vlastní primátorce Adrianě Krnáčové. O den později pak následovalo bouřlivé noční jednání předsednictva ANO za účasti pražských zastupitelů a přímé účasti Andreje Babiše i jeho pravé ruky (a současně šéfky pražského ANO) Radmily Kleslové, po němž se ukázalo, že si Adriana Krnáčová svůj post prozatím udrží. A to i přesto, že podle informací Týdeníku Echo vyzývala část zúčastněných Krnáčovou k rezignaci. Jaký má Radmila Kleslová vztah ke svému šéfovi Andreji Babišovi a jak uplatňuje svůj vliv v zákulisí ANO? I o tom je rozhovor Týdeníku Echo s první místopředsedkyní hnutí ANO Radmilou Kleslovou.

    Temata: Tydenik Echo, ANO, Radmila Kleslovó
    Politickou scśnu rozvićila tento tyden prażskś organizace ANO, kterś v pondóli nevyjśdżila dńveru vlastni primśtorce Adriane Krnśćovś. 0 den pozdeji pak nśsledovalo boućlivć noćnijednśni pćedsednictva ANO za ńćasti prażskych zastupitelń a pćimś ńćasti Andreje Babiśe ijeho pravś ruky (a soućasnś śćfky prażskśho ANO) Radmily Kleslovś, po nemż se ukśzalo, że si Adriana Krnśćovś svńj post prozatim udrżi. A to i pćesto, że podle informaci Tydeniku Echo vyzyvala ćśst zńćastnónych Krnśćovou k rezignaci. Jaky mś Radmila Kleslovś vztah ke svómu śśfovi Andreji Babiśovi a jak uplatńuje svńj vliv v zśkulisi ANO? Io tom je rozhovor Tydeniku Echo s prvni mistopćedsedkyni hnuti ANO Radmilou Kleslovou.
Children
No Data