The PDF Assistant for Trados is an Add-In for Trados Studio that supports the conversion of a PDF to a DOCX so it can be successfully translated and delivered as a DOCX target file.
Table of Contents
Installation
The application is an sdlplugin and can be installed either by visiting the RWS AppStore, downloading, and then manually installing by double clicking the sdlplugin file in the usual way. Alternatively the plugin can be installed throgh the Integrated AppStore in Trados Studio. For this to work you must have Microsoft Office installed. The testing was carried out on computers using Office 365 and not on older versions.
Where is it installed?
The plugin is installed into the ribbon in the "Add-Ins" tab and into the "Toolbox" group:
Working with PDFs
The application is designed to support the conversion of PDF files into a DOCX so that you can improve the quality of the DOCX prior to translating it in Trados Studio. The reason we have taken this approach is because PDF to DOCX conversion without professional editing software can sometimes cause formatting issues, resulting in a document that looks different from the original PDF.
The more common problems that can occur during PDF to DOCX conversion would be things like:
-
Text and image placement: Sometimes, the text and image placement can become distorted during conversion, causing the final document to look different from the original PDF.
-
Formatting issues: PDFs often have complex formatting, such as columns, tables, and graphs. These elements can be difficult to convert to DOCX, leading to formatting issues in the final document.
-
Fonts: If the PDF contains fonts that are not installed on the computer doing the conversion, the text can appear differently in the final document.
-
Large files: PDF files can be very large, and converting them to DOCX can result in large files that take up a lot of storage space.
-
Security features: Some PDFs have security features that prevent copying and pasting, which can make it difficult to convert the document to DOCX.
-
OCR issues: If the PDF contains scanned images or text that was not originally digital, OCR (optical character recognition) software is needed to convert the text. However, OCR can sometimes produce errors or miss characters, leading to mistakes in the final document.
-
Unnecessary Tags: any of the above problems can lead to many unnecessary control tags being inserted into the DOCX that will become visible when working with a translation tool.
-
Poor Segmentation: similarly any of the above issues can lead to unnecessary hard returns being added into the DOCX and these will also make translation more difficult than is necessary.
-
Incorrect character display: If the character encoding is incorrect, it can cause characters to be displayed incorrectly in the final document. For example, some characters may appear as question marks or boxes especially with Asian character sets.
-
Missing characters: In some cases, incorrect encoding can cause certain characters to be missing from the final document. This can result in text that is difficult to read or understand.
-
Encoding conflicts: If different parts of the document are encoded in different ways, it can cause conflicts and errors during conversion. For example, some characters may be encoded in UTF-8 while others are encoded in ASCII, leading to errors when the document is converted to a PDF or other format.
It's important to note that the quality of the conversion largely depends on the quality of the original PDF and the conversion software used. Some conversion tools may produce better results than others. This "Add-In" initially makes use of the Microsoft Word desktop API providing simple text conversion and also some OCR capabilities. Whilst you could simply use Word and avoid the "Add-In" altogether it's worth noting that the plugin does provide more support than Microsoft makes available through Microsoft Word, in particular around OCR capability.
Using the "Add-in"
Adding your files
The PDF Assistant for Trados is started by clicking on the icon in the ribbon. This opens up a small wizard where you can add your files:
You can add as many files as you like, in as many languages as you like, but keep in mind the process could take a considerable amount of time and may even run out of memory if you ask for too much. How many files you can use really depends on the number of pages, number of images in the file, amount of OCR work required etc. Think about the work you are about to carry out and don't expect miracles!
The files or folders can be added via drag and drop, or by using the small icons in the wizard. In this example two PDF files have been added. An English language text containing two images, one that needs to be OCR'd and one that does not; and a Korean document that is non-readable, so the entire content is one big image in the PDF.
Selecting your Provider and OCR options
This screen allows you to do several things:
- select the PDF Assistant you wish to use. For now there is only Microsoft Word to select from.
- check the option to specify whether or not you wish to extract text from the images and if so (in the next screens) which ones you would like to be processed (OCR'd)
- keep in mid that when you OCR the images you will lose any background image that was there and will only have the text that the software was able to extract
You can cancel the process at any time if the file is too complex for the application to manage:
Image Selection
This part of the wizard will extract the images the software was able to identify and allow you to specify which of the images contain translatable text:
In this example I have only selected two images for OCR'ing... the table image in the English file and a small banner in the Korean. I can then click next to be presented with the Summary
Summary Stage
This screen in this stage of the wizard displays a summary of the options you have chosen for the conversion:
Preparation
The final stage provides an indication of the progress until the conversion has completed:
It is possible that some PDF files cannot be processed and the Word API could cause the conversion to get stuck like this:
Or return an error. If it gets stuck you'll probably need to crash out of Studio, and possibly Microsoft Word as well. Do this via the task manager (Ctrl+Shift+Escape) and "End task":
When this occurs you will probably find that Microsoft Word can open the file, but it will remain as an uneditable image. In this case you may be left with these options:
- you need a more sophisticated conversion software to be able to manage the file
- transcribe the file and recreate it, add any images in by scanning the document and cutting out the images you need
- try to get a copy of the original source file prior to being converted into a PDF (this would really be preferable to a PDF from the very start!)
PDF translation is not a fool proof business!!
DTP the converted files before Translation
Now you can open your converted PDF files as a DOCX in Microsoft Word and improve the quality of the file before you translate it. This way the target file will probably be ready to go, or at least require minimal editing to accommodate changes required as a result of text expansion/contraction in the target language.
A good tool for tidying up files resulting from a messy PDF conversion is TransTools available here - https://www.translatortools.net/products/transtools
In the example files, the English file contained two images, one that was OCR'd and the other treated as an image. The result isn't bad (PDF on the left, converted DOCX on the right) and if you were to open this PDF file in Microsoft Word both images would be handled as images, so the "Add-In" does provide considerable value here. The table needs tidying up but it is editable and could save time when more extensive text is involved:
On the Korean non-readable PDF. Some formatting would be required, but it's not too bad. The image is floating and can be positioned wherever I like, and all the text is available to me for translation. So some small amount of DTP work and I'll have a file that is easily translatable and the target file should be good with minimum work required: