Studio 2015 + OCR functionality

"Easily and quickly translate PDFs

It can often be frustrating receiving an image or PDF to translate but the new built in OCR reader in Studio 2015 will make translating PDFs and images much easier, even if they have been created from a scanned document."

"Save a little extra time: Translate scanned PDFs
Working on uneditable PDF documents can be frustrating and timeconsuming. Studio 2015 lets you translate any PDF fie in Studio even if it’s created from a scanned document. The new built-in OCR functionality extracts the text and converts into a translatable file."

...


This is what the SDL advertisement says. But the reality is really, really sad and poor. I tried to convert many German, Czech and Slovak PDF files and the results of the OCR recognition were absolutely unusable! Instead of characters with accents you get some nonsensical characters, there are nonsensical spaces in the middle of the words etc. etc.

Will SDL solve this problem? Did SDL even test this functionality with other languages than English? My colleagues said, the only language OCR works with is English. 

I really feel cheated, because the OCR functionality was one of the TOP features in SDL advertisements. By the way, my 15 years old Version of FineReader gives me better recognition results than Studio 2015. So it is still easier to go the old, complicated way, i.e. convert PDF into TIFF, run the OCR recognition TIFF with FineReader and save the recognized text as a Word file.

The reason why I purchased the Studio 2015 upgrade was this functionality - which does not work yet at all.

I hope I get an answer from SDL people here. I posted a similar text one month ago - with no reaction from SDL.

  • Hi Adrian,

    Which post received no answer? Apologies if this is the case as I try to make sure all posts received a response where required.

    On this one, there really isn't anything SDL need to resolve here, or rather can resolve easily. Comparing this with FineReader isn't really "apples for apples" as FineReader is a dedicated tool specifically for handling PDF files and in my opinion is the best approach for anyone who has to handle PDF files on a regular basis.

    If you have simple PDF files, based on reasonably good documents in the first place, then you might get quite good results with the SDL filetype. But this is not a PDF editor, and you don't get any opportunity to edit the files before processing at all. My recommendation (if you don't have FineReader for example) to anyone is to process the files in Studio and then use the DOCX files in the translation process after tidying them up (check formatting, remove innapropriate hard breaks etc.)

    I think expecting Studio to be the perfect solution for all PDF files with no pre-processing is unrealistic. There's a good reason why even FineReader can't get if perfectly right every time and they dedicate their entire software for this.

    Regards

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi Paul,
    thank you for your response, my last post was written aprox. 26 day ago (https://community.sdl.com/products-solutions/language/translationproductivity/f/90/t/4704)
    I would like to accent, that my FineReader version 15 years old, it is a very simple and user unfriendly app, which was delivered with a Lexmark scanner. It does not support PDF files, I have to transform the PDF file into a graphic file first, e.g. jpeg or tiff, and then I can run the OCR process.
    The scanned pdf documents I tried to convert with Studio 2015 had a very good quality, very well recognizable with FineReader. I tried to make a pdf file from a screenshot too - that is I think the best quality you can get for a OCR recognition. But there is still the problem with the Czech, Slovak or German characters - not with some of them, but with ALL of the specific characters in the document. And there were many spaces in the middle of the words, etc. Sure, you can (and have to) change some wrong formatting or hand breaks etc. in recognized docs. But in my case it would take many hours to change all the wrong characters in my Czech, Slovak or German documents...
    In the file "SDL_pb_WhatsNew_Studio2015_EN_A4_hires_tcm73-83125.pdf" I downloaded on the SDL web page you say: "Working on uneditable PDF documents can be frustrating and timeconsuming. Studio 2015 lets you translate any PDF fie in Studio even if it’s created from a scanned document. The new built-in OCR functionality extracts the text and converts into a translatable file." This is a clear message in my opinion: "You can work with uneditable PDF´s in Studio 2015. Studio can extract the text a converts it into a file you can work with".
    If I should write it shortly and clearly: I did not expect Studio to be a PERFECT tool to handle PDF files. I´ve just expected a tool, that would work as SDL has described in his advertisements.
    For your guidance you find an example for the original Czech text copied from a news web page and the "recognized" text from a screenshot below. This is a typical result of the OCR recognition, even if I use a perfect PDF file.
    Regards,
    Adrian


    Témata: Týdeník Echo, ANO, Radmila Kleslová
    Politickou scénu rozvířila tento týden pražská organizace ANO, která v pondělí nevyjádřila důvěru vlastní primátorce Adrianě Krnáčové. O den později pak následovalo bouřlivé noční jednání předsednictva ANO za účasti pražských zastupitelů a přímé účasti Andreje Babiše i jeho pravé ruky (a současně šéfky pražského ANO) Radmily Kleslové, po němž se ukázalo, že si Adriana Krnáčová svůj post prozatím udrží. A to i přesto, že podle informací Týdeníku Echo vyzývala část zúčastněných Krnáčovou k rezignaci. Jaký má Radmila Kleslová vztah ke svému šéfovi Andreji Babišovi a jak uplatňuje svůj vliv v zákulisí ANO? I o tom je rozhovor Týdeníku Echo s první místopředsedkyní hnutí ANO Radmilou Kleslovou.

    Temata: Tydenik Echo, ANO, Radmila Kleslovó
    Politickou scśnu rozvićila tento tyden prażskś organizace ANO, kterś v pondóli nevyjśdżila dńveru vlastni primśtorce Adriane Krnśćovś. 0 den pozdeji pak nśsledovalo boućlivć noćnijednśni pćedsednictva ANO za ńćasti prażskych zastupitelń a pćimś ńćasti Andreje Babiśe ijeho pravś ruky (a soućasnś śćfky prażskśho ANO) Radmily Kleslovś, po nemż se ukśzalo, że si Adriana Krnśćovś svńj post prozatim udrżi. A to i pćesto, że podle informaci Tydeniku Echo vyzyvala ćśst zńćastnónych Krnśćovou k rezignaci. Jaky mś Radmila Kleslovś vztah ke svómu śśfovi Andreji Babiśovi a jak uplatńuje svńj vliv v zśkulisi ANO? Io tom je rozhovor Tydeniku Echo s prvni mistopćedsedkyni hnuti ANO Radmilou Kleslovou.
  • Hi Adrian,

    The Studio OCR functionality is based on the Solid Documents engine, which covers 14 languages - including German, but not Czech or Slovak. That explains your frustration with the last two languages.

    The settings at the bottom of the PDF file type are very important when you switch between editable and non-editable PDFs. Please check out a blog post I wrote, which explains these settings.

    In any case, if you work with a lot of scanned PDFs, an up to date version of FineReader would be the way to go. With FineReader, you can select the recognition language, include/exclude images, draw/edit  tables, add words/symbols to custom dictionaries, etc. 

    HTH,

    Emma

  • Hi Emma,
    thanks for your answer an the link - your blog article is very helpful.
    I tried all the different settings before, but nothing works properly, the best way is still my ancient version of FineReader.
    Your answer explains btw. something I should have known before I have purchased the Studio upgrade. The SDL marketing promises something the program can not perform. Where could I get the information about the 14 supported languages? Should I (as a end-user) know an be interested in, on which engine bases a partial functionality of a complex sw product? If the advertisement says "you can work with uneditable pdf files", you usually expect, that you really can work with those files. By the way, the recognition results of German PDF files are poor too, although this language should be supported. And I do not mean the resulting layout, graphics etc. - I mean just the result of a simple OCR text recognition.
    From here I think SDL should say loudly and clearly before you purchase the advertised product: We support only 14 languages - some of them in a debatable quality. Do not expect a wonder, the best way for you is to purchase a product, that really supports PDF.
    As a paying customer I can not agree with Paul, who says " there really isn't anything SDL need to resolve here"...
    In my opinion the SDL marketing promises something the SDL developers can not / do not want to realize.
    Best regards,
    Adrian
  • Unknown said:

    In my opinion the SDL marketing promises something the SDL developers can not / do not want to realize.

    Adrian

    Let me just clarify that the SDL developers are not part of the game here because the PDF module is a third-party product (Solid PDF Converter), for which SDL bought a license to be able to incorporate it into Studio. It has not been developed by SDL.

    Walter

  • Hi Adrian,

    The extreme OCR functionality, in any language even scanned, that you feel SDL Trados Studio 2015 should be capable of was never promised to we who beta tested the product, or we would have challenged it if we found the functionality to be less than expected.

    However, I think most of us understand that accessible text PDF is a hugely complex format coming from multiple sources containing front-end and background content such as images, user and web interactivity, fonts, etc. Portable Document Format was devised as a means of taking information across the gaps between incompatible operating systems or software formats but was not created specifically to be word processed, let alone translated. Then scanned PDF format, often basically an image, is completely different again.

    I would be totally amazed if any software as complex as Studio, designed for such a wide range of functionality, could do what you're hoping, if it wasn't specific OCR software. I've been using the software for many years and it has become something quite amazing over the years. It has so much functionality that makes our lives as translators easier and our work so much more efficient and competitive.

    All the best,
    Ali
  • Hi Walter,
    ok, I can understand, that SDL did not develop the PDF module. But in fact the PDF convertor is just a part of a very complex piece of software sold by SDL. And it was not the developer of Solid PDF Converter, who sent me dozens of emails with Studio 2015 ads... And a part of this self-praise messages was: Studio 2015 can work with uneditable PDF documents. DOT. That is the message to me - I am the end-user who is not able to investigate, who developed which part of Studio, and to be honest, I´m even not interested in. I just want to get the functionality which was merchandised.
    Regards,
    Adrian
  • Hi Ali,
    let me say first, that I am convinced that Studio (Trados in the past) has always been and is still the best tool for translators. That is fact.
    I can understand, that the software is still being developed and improved. But the developer of a software should not sell a product that does not work as merchandised!
    Unfortunately, not all translators are beta testers of Studio. That means, that we, normal people and end users, could not now, what SDL intended to sell - a OCR converter that can convert texts just in 14 languages, and even not properly.
    We, normal end users, just got dozens of advertising emails - and those emails (and the SDL web too) said, "Studio can convert uneditable PDF files into a translatable text". This is a information I got as a non-beta-tester, it is the only information I got from SDL (I could read nowhere, that my expectations should not be toooooo high, because ... (some excuses). I repeat again: My 15 years old version of FineReader supports 63 (!!!) languages. And the quality of recognition is very, very high - with "normal" or not very good scanned files aprox. 90-95% of usable text, when I get a file created digitally as a non-editable (picture) PDF, I almost do not need to edit the source word file. Even if the formatting of the source text is very complicated and complex, FineReader delivers an usable and translatable text at least I can work with (I can choose if I want to convert it as a Word file or just as plain text).
    After this experience with this 15 years old piece of software I really was able to believe the SDL advertisement, that Studio CAN WORK WITH PDF FILES, as promised.
    Let me repeat again please: not everybody is a beta-tester who is familiar with the intentions of the software developer. That means, that we, simple users, depend on the honesty of the SDL marketing...
    Regards,
    Adrian
  • Unknown said:

    ... I am convinced that Studio (Trados in the past) has always been and is still the best tool for translators

    ... the developer of a software should not sell a product that does not work as merchandised!

    ... SDL intended to sell - a OCR converter that can convert texts just in 14 languages, and even not properly.

    ... advertising emails ... said, "Studio can convert uneditable PDF files into a translatable text".

    ... we, simple users, depend on the honesty of the SDL marketing...

    Hi Adrian,

    Taking the above points in order.

    I'm glad you have been able to benefit from the excellence of Studio and its predecessors, it really is a stonking piece of kit - I love it, in case you couldn't tell ;-) 

    Re your second point, it's not the developers who do the selling (but I'm 'splitting hairs' in pointing that out...)

    We can't really surmise what 'SDL' intended to sell because SDL is not a single entity, it's a huge composite of many departments (ditto)

    Indeed the ads did predict that Studio can convert uneditable PDFs into translatable text - it can, but not as well as dedicated OCR software, or in as many languages, or as some unfortunate users had hoped. I would imagine this functionality will continue to improve so that the occasions when a PDF is just too far removed from what Studio and the 3rd party OCR software it integrated is capable of handling become less frequent.

    Regarding depending on the honesty of marketing, I had a friend who was a programmer for an engineering control software house. His biggest complaint was that the sales people made promises of functionality before the programmers had even started work on it. He enjoyed complaining and was ignoring the fact of life that demand dictates progression. Here, of course, we have a more complex situation than that - 3rd party software integration. To program the interaction required between the two on top of everything else the amazing programmers of Studio have achieved, is a pretty tall order.

    All this being said, I am not arguing with you, just addressing the points you raised out of politeness and I totally understand your perspective and disappointment :)

    All the best,

    Ali