You are currently reviewing an older revision of this page.

History View current version

CleanUp Tasks

So what does this tool do?

You can lock segments based on structure or content
You can remove unwanted tags in the source
You can modify the source or target text as you like and create “settings” files for easy reuse
You can create tags for embedded xml or html content
You can create placeholders for fixed words or phrases

Some of the above is possible already with other tools, but the best part is this is a Batch Task, so you can run it directly in Trados. If you think any of the above may be of interest, please read on.

New Batch Task Menu Items:

The tool adds 2 new items to your batch task menu:

Cleanup Source

When you click on Cleanup Source and then hit “Next”, you will be greeted with the following screen:

Locking segments

You can lock segments based on search expressions using the left-hand box (the Content Locker). In order to lock based on the document structure, use the right-hand box (the Structure Locker).

Content Locker Example

I mainly translate from Japanese to English and often times you get segments that contain no Japanese characters. It can be useful to lock these sometimes, the following regular expression would check for that: ^[^亜-熙ぁ-んァ-ヶ]+$

Make sure you turn on Regex for the above to work

The headers in the above screenshot are abbreviated for space reasons, so they might be a little difficult to understand:

Regex: Regular expression matching
Case: Case-sensitive searching
Whole: Whole word matching

Structure Locker Example

This should be straightforward, the structure info is read from the sdlxliff files of the project. The example file I used happens to be an Excel file, which is why you see items like sdl:worksheet and sdl:textbox. In the following screenshot I selected sdl:textbox to lock any text that appears in text boxes.

Removing tags

The plug-in divides tags into two categories, Formatting Tags and Placeholder Tags:

Formatting Tags: These always start with <cf>.

<cf> tags can contain a range of information such as font name, font size, italic, bold, etc. In Example 1 below, each tag contains the font name and size only, while Example 2 contains an italic="True"attribute.

Example 1 (Font Name and Size):

Example 2 ( italic ="True")

In order to remove the tags in Example 1, you need to select Font Name and Font Size (see screenshot below), since the tag specifies both of these:

However, the tag in Example 2 will not be removed as it contains italic="True". To remove this tag, you also need to select Italic:

Placeholder Tags:

In short, these are the <ph> (Placeholder) tags in the sdlxliff file. Sometimes they contain inline formatting which may not be needed.

I would exercise caution when removing these tags though as often times they are necessary!

In the following screenshot, the <br> tags are used for aligning text in text boxes in the original Excel file, they are probably required, but there might be times when you want to remove this type of formatting.

Currently, I do not permit removing other types of tags other than the above. Let me know though if you have a use case for removing other types of tags.

Modifying text

Now to the main part of the plug-in. When you first start out, you will have an empty screen like below:

First, click on the New button to create a new “Conversion File”.

The following window should pop up and it will appear blank at first:

Click the “+” mark in the top right corner as shown and a new row will be added to the grid like so:

Now, I would like to demonstrate a few use cases to show how to use the tool.

Use Case: Converting wide characters to their narrow equivalent

In Japanese text, wide and narrow forms of characters are used:

Wide	Narrow
ＡＢＣＤ	ABCD
１２３４	1234
カタカナ	ｶﾀｶﾅ

One issue is that, depending on the client, they may use different forms in their documents. You may even find a mix of these forms in the same document. These mixed forms can also cause problems with your matching results, and your translation memories will be cluttered with them.

One solution is to unify these forms before translation:

In the above screenshot I have created 3 rules:

Wide to narrow: Alphabetic
- Ensure all alphabetic characters are narrow
Wide to narrow: Numbers
- Ensure all numbers are narrow
Narrow to wide: Katakana
- Ensure all Katakana characters are wide

To create a rule, you enter your information in the input area shown below:

Title: This field can be left blank, it just gives a description of the search item, and allows you to find an item easier in the grid view.
Search: The text you want to search for. In the example I use a regular expression to search for a single wide alphabetic character, it probably would be more efficient to use [Ａ-Ｚ]+ to search for groups of characters though.
Search Settings: The search settings explained from left to right are:
- Case Sensitive: Case sensitive searching
- Regex: Use regular expression matching
- Whole Word: Match whole words
- Tag Pair
- Embedded Tags
- StrConv

Embeded Tags

A common issue with translations, is handling embedded tags.

For example:

The cleanup tasks tool provides a way to convert these into “real” tags.

You could use the following setting:

The above setting will detect the <b> tag in the example.

When you run the task on the example, it will be converted as shown below:

Important Note: Note that in my example, I did not show a setting converting the <span> tag. This is important, as even though I only created a rule to detect the <b> tag, the plug-in will convert all tags it finds within the segment.

Now, when you generate the target translations, any converted tags will be restored to their former form:

StrConv

StrConv happens to be a handy method from Visual Basic. You can find it in a lot of Microsoft Products, such as Office VBA.

The handy part is shown in the following screenshot (courtesy MSDN).

All the options above are available under their same names in the tool: By selecting Narrow in the tool, I can convert any wide character to its narrow equivalent.

When you turn on the StrConv option, the Replace window becomes greyed out.

Storing conversion files for reuse

One problem I have found with current solutions, is there is little ability for reuse. For example, SDLXLIFF Toolkit is a great tool, but you have to retype each item you need to search for. With this tool, click Save As in the bottom right corner to save your settings file for later use:

Once you have saved your file, it will appear in the following list.

Important Note: Order matters! Each file will be used for processing starting from top to bottom.

I would recommend creating separate conversion files based on project, or divide them into categories.

Tag Pair

I actually don’t know how useful this feature will be, but you can detect tag pairs in the source text and modify them.

For example, in the following screenshot, I look for a <cf highlight="yellow"> tag and replace the contents with some random text:

Another example is taking a tag pair and replacing it with a placeholder instead:

Say you had the following made up <inline> tag pair in your XML file:

With the following rule (make sure placeholder is turned ON!):

RWS AppStore > Wiki

CleanUp Tasks

Use Case: Converting wide characters to their narrow equivalent