The PDF Assistant for Trados is an Add-In for Trados Studio that supports the conversion of a PDF to a DOCX so it can be successfully translated and delivered as a DOCX target file.
Table of Contents
Installation
The application is an sdlplugin and can be installed either by visiting the RWS AppStore, downloading, and then manually installing by double clicking the sdlplugin file in the usual way. Alternatively the plugin can be installed throgh the Integrated AppStore in Trados Studio.
Where is it installed?
The plugin is installed into the ribbon in the "Add-Ins" tab and into the "Toolbox" group:
Working with PDFs
The application is designed to support the conversion of PDF files into a DOCX so that you can improve the quality of the DOCX prior to translating it in Trados Studio. The reason we have taken this approach is because PDF to DOCX conversion without professional editing software can sometimes cause formatting issues, resulting in a document that looks different from the original PDF.
The more common problems that can occur during PDF to DOCX conversion would be things like:
-
Text and image placement: Sometimes, the text and image placement can become distorted during conversion, causing the final document to look different from the original PDF.
-
Formatting issues: PDFs often have complex formatting, such as columns, tables, and graphs. These elements can be difficult to convert to DOCX, leading to formatting issues in the final document.
-
Fonts: If the PDF contains fonts that are not installed on the computer doing the conversion, the text can appear differently in the final document.
-
Large files: PDF files can be very large, and converting them to DOCX can result in large files that take up a lot of storage space.
-
Security features: Some PDFs have security features that prevent copying and pasting, which can make it difficult to convert the document to DOCX.
-
OCR issues: If the PDF contains scanned images or text that was not originally digital, OCR (optical character recognition) software is needed to convert the text. However, OCR can sometimes produce errors or miss characters, leading to mistakes in the final document.
-
Unnecessary Tags: any of the above problems can lead to many unnecessary control tags being inserted into the DOCX that will become visible when working with a translation tool.
-
Poor Segmentation: similarly any of the above issues can lead to unnecessary hard returns being added into the DOCX and these will also make translation more difficult than is necessary.
-
Incorrect character display: If the character encoding is incorrect, it can cause characters to be displayed incorrectly in the final document. For example, some characters may appear as question marks or boxes especially with Asian character sets.
-
Missing characters: In some cases, incorrect encoding can cause certain characters to be missing from the final document. This can result in text that is difficult to read or understand.
-
Encoding conflicts: If different parts of the document are encoded in different ways, it can cause conflicts and errors during conversion. For example, some characters may be encoded in UTF-8 while others are encoded in ASCII, leading to errors when the document is converted to a PDF or other format.
It's important to note that the quality of the conversion largely depends on the quality of the original PDF and the conversion software used. Some conversion tools may produce better results than others. This "Add-In" initially makes use of the Microsoft Word desktop API providing simple text conversion and also some OCR capabilities. Whilst you could simply use Word and avoid the "Add-In" altogether it's worth noting that the plugin does provide more support than Microsoft makes available through Microsoft Word, in particular around OCR capability.
Using the "Add-in"
Adding your files
The PDF Assistant for Trados is started by clicking on the icon in the ribbon. This opens up a small wizard where you can add your files:
You can add as many files as you like, in as many languages as you like, but keep in mind the process could take a considerable amount of time and may even run out of memory if you ask for too much. How many files you can use really depends on the number of pages, number of images in the file, amount of OCR work required etc. Think about the work you are about to carry out and don't expect miracles!
The files or folders can be added via drag and drop, or by using the small icons in the wizard. In this example two PDF files have been added. An English language text containing two images, one that needs to be OCR'd and one that does not; and a Korean document that is non-readable, so the entire content is one big image in the PDF.
Selecting your Provider and OCR options
This screen allows you to do several things:
- select the PDF Assistant you wish to use. For now there is only Microsoft Word to select from.
- check the options to specify whether or not you wish to extract text from the images and if so which ones you would like to be processed (OCR'd)
- keep in mid that when you OCR the images you will lose any background image that was there and will only have the text that the software was able to extract
Image Selection
This part of the wizard will extract the images the software was able to identify and allow you to specify which of the images contain translatable text:
In this example I have only selected two images for OCR'ing... the table image in the English file and a small banner in the Korean. I can then click next to be presented with the Summary
Summary Stage
This screen in this stage of the wizard displays a summary of the options you have chosen for the conversion:
Preparation
The final stage provides an indication of the progress until the conversion has completed:
It is possible that some files cannot be processed and the Word API causes a result like this:
When this occurs you will probably find that Microsoft Word can open the file, but it will remain as an uneditable image. In this case you need a more sophisticated conversion software to be able to manage the file.