A Bug that needs to be addressed for the Tibetan script

I was recently in conversation with one of the moderators, Paul Filkin, about this, but I got an automated reply that he is currently not available, so I thought I would post this here.
I have been setting up SDL Trados studios for about 3 weeks now, and I noticed that I was getting very poor leverage for Tibetan-English TMs and Concordance search. After going through many research, tests, and tutorials I was able to identify the definite cause of the problem.
Let me explain first that in the Tibetan script there are no spaces between words instead there are little dots called tseks, e.g., ཐམས་ཅད་མཁྱེན་ཅིང་ཀུན་གཟིགས།  is a phrase with seven words and dot between each word. Trados Studios is recognizing each of these dots as a character rather than a word delimiter, thereby registering phrases as single words and as a result getting very poor leverage for both fuzzy matching and concordance search.
I was able to confirm this by doing a simple test: in the document called "Test C Tibetan Script" I created a TM for a simple 8 lines of verse (creating one segment for verse) then I ran this against a new source text where I altered a word or two on each line so that we would expect a fuzzy match for each line, though I left one line 100% as a control.
Then I ran the exact same test in another document called "Test C Tibetan Romanized Transliteration". The exacts same TM and source except all the Tibetan was converted into the Wylie Roman transliteration, where all of the punctuation dots are replaced by spaces, the example phrase from above would thus be rendered "thams cad mkhyen cing kun gzigs/".
The results where dramatically different, in the first test with Tibetan script, only the control line that was a 100% match hit, all the other lines failed to make even the 50% threshold for fuzzy matching. However, in the second test with Tibetan translation, all the lines registered as matches. The 100% line registered as thus, and all the other lines registered in the 70-95% range that we would expect for a CAT tool like SDL Trados. I got similar results from the concordance searches.
I have attached my tests to this post below for reference.
So basically, SDL Trados just needs a very small update to recognize the Tibetan punctuation properly. It should recognize all the tseks, the little punctuation dots (Unicode =  U+0F0B) as word delimiters equivalent to spaces (Unicode = U+0020). This minor adjustment would make the platform functional for Tibetan as it is with many other languages. I noticed for instance, that another question in the community forum (the only other question tagged "Tibetan") was from a Tibetan-Chinese translator who was have a problem similar to what I am experiencing.
At this point I am 100% confident that this is the problem. Would it be possible to please inform someone in SDL Trados' development or IT to develop an update or patch to address this very simple fix? It would make a world of difference not just to me but anyone using your product for Tibetan translation. Otherwise the fuzzy match and concordance search function very poorly and one would get better results by just searching the TMs in a text editor.
After trying the plat form for three weeks I recently just made a review of SDL Trados Studios to a forum of Tibetan translators and I had to make a very poor review of the platform because of this issue, however, if this one issue was fixed, I would redact my poor review for a very positive one.
I note that, it would be possible to convert all the Tibetan into Roman transliteration for use in SDL Trados, but this would be very impractical for myself and other translators, as the Tibetan script is widely used, and Roman transliteration is not very comfortable to read, plus all ones TMs in Tibetan script would need to be converted, which would be a complicated process. Also any new users to SDL Trados would expect the Tibetan script to work and encounter the same problem that I had.
Here are tests attached, each is accompanied by the .sdltm that I ran it against. I am happy to communicate with developers or other staff if there are any questions about the Tibetan script from developers.
Thank you very much for taking the time to look into this.
Best wishes,
-Celso
Parents
  • Hi

    Yes, thank you for persevering with these tests.  I played around with this today and can obviously reproduce this, although without your help I would not have been able to identify the likely problems.  I copied  as he may be interested to take a look at this problem too.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Yes, thank you for being so responsive. I think it should be a really easy fix, it is just in regard to how the platform reads the Unicode for this one punctuation character. I was almost wondering if there was a fix from a preference or option within SDL Trados, but I was unable to find a setting that would change how the platform reads Unicode characters when performing fuzzy matching and concordance searches, so I suspect it is something that needs to be fixed in the code.

    I suspect there just hasn't been a lot of Tibetan language translators using SDL yet to notice the problem.

    My trial has actually ended now, and I don't want to purchase the platform without knowing this is functioning, but I will stay in touch in the forum here and if this can be corrected. I will do another trial on a different computer to verify this and send word out to my translators group forum and update my review. 

    If there is any way I can help, such as explaining Tibetan language structure or further test examples just let me know.

Reply
  • Yes, thank you for being so responsive. I think it should be a really easy fix, it is just in regard to how the platform reads the Unicode for this one punctuation character. I was almost wondering if there was a fix from a preference or option within SDL Trados, but I was unable to find a setting that would change how the platform reads Unicode characters when performing fuzzy matching and concordance searches, so I suspect it is something that needs to be fixed in the code.

    I suspect there just hasn't been a lot of Tibetan language translators using SDL yet to notice the problem.

    My trial has actually ended now, and I don't want to purchase the platform without knowing this is functioning, but I will stay in touch in the forum here and if this can be corrected. I will do another trial on a different computer to verify this and send word out to my translators group forum and update my review. 

    If there is any way I can help, such as explaining Tibetan language structure or further test examples just let me know.

Children
No Data