Difference between words (source) and characters (source) for Asian source languages

Hi all,

We are having issues with the word counts for Asian source languages, especially for Korean.

Looks some instructions are a bit confusing or unclear:

https://docs.sdl.com/LiveContent/content/en-US/SDL%20WorldServer-v3/GUID-376E123B-1C7E-4D64-82B0-1D33F088ABD5

Here it says:

 

  • Each word is counted as one point to be added to the total scoped value.
  • For Chinese and Japanese, WorldServer has a special way to count words. Each character is considered a word. For these languages we are, effectively, counting characters. When a user sees "Words" in the WorldServer UI (for example, in scoping) for Chinese and Japanese source languages it actually means "Characters". If a content is a mixture of Chinese or Japanese and Latin-based languages, the appropriate word counting scheme is used for each language. For example, "WorldServer " is counted as 6 words.
  • In Korean, word counting is based on white spaces, not characters.

 

 

If we are using the new word-based tokenization, the word column reports Asian-language words as words identified by the new tokenization engine and Western-language words also as one word. This typically always results in a lower word count. Which one is correct because the word count for character and word are different.


Also for Korean, it is said that spaces are counted for the word count, which is very inaccurate as Korean words are often times not separated by spaces. Here it would make much more sense to use the actual character count.


Thanks for your insight!

Irene

Parents Reply Children