A Bug that needs to be addressed for the Tibetan script

I was recently in conversation with one of the moderators, Paul Filkin, about this, but I got an automated reply that he is currently not available, so I thought I would post this here.
I have been setting up SDL Trados studios for about 3 weeks now, and I noticed that I was getting very poor leverage for Tibetan-English TMs and Concordance search. After going through many research, tests, and tutorials I was able to identify the definite cause of the problem.
Let me explain first that in the Tibetan script there are no spaces between words instead there are little dots called tseks, e.g., ཐམས་ཅད་མཁྱེན་ཅིང་ཀུན་གཟིགས།  is a phrase with seven words and dot between each word. Trados Studios is recognizing each of these dots as a character rather than a word delimiter, thereby registering phrases as single words and as a result getting very poor leverage for both fuzzy matching and concordance search.
I was able to confirm this by doing a simple test: in the document called "Test C Tibetan Script" I created a TM for a simple 8 lines of verse (creating one segment for verse) then I ran this against a new source text where I altered a word or two on each line so that we would expect a fuzzy match for each line, though I left one line 100% as a control.
Then I ran the exact same test in another document called "Test C Tibetan Romanized Transliteration". The exacts same TM and source except all the Tibetan was converted into the Wylie Roman transliteration, where all of the punctuation dots are replaced by spaces, the example phrase from above would thus be rendered "thams cad mkhyen cing kun gzigs/".
The results where dramatically different, in the first test with Tibetan script, only the control line that was a 100% match hit, all the other lines failed to make even the 50% threshold for fuzzy matching. However, in the second test with Tibetan translation, all the lines registered as matches. The 100% line registered as thus, and all the other lines registered in the 70-95% range that we would expect for a CAT tool like SDL Trados. I got similar results from the concordance searches.
I have attached my tests to this post below for reference.
So basically, SDL Trados just needs a very small update to recognize the Tibetan punctuation properly. It should recognize all the tseks, the little punctuation dots (Unicode =  U+0F0B) as word delimiters equivalent to spaces (Unicode = U+0020). This minor adjustment would make the platform functional for Tibetan as it is with many other languages. I noticed for instance, that another question in the community forum (the only other question tagged "Tibetan") was from a Tibetan-Chinese translator who was have a problem similar to what I am experiencing.
At this point I am 100% confident that this is the problem. Would it be possible to please inform someone in SDL Trados' development or IT to develop an update or patch to address this very simple fix? It would make a world of difference not just to me but anyone using your product for Tibetan translation. Otherwise the fuzzy match and concordance search function very poorly and one would get better results by just searching the TMs in a text editor.
After trying the plat form for three weeks I recently just made a review of SDL Trados Studios to a forum of Tibetan translators and I had to make a very poor review of the platform because of this issue, however, if this one issue was fixed, I would redact my poor review for a very positive one.
I note that, it would be possible to convert all the Tibetan into Roman transliteration for use in SDL Trados, but this would be very impractical for myself and other translators, as the Tibetan script is widely used, and Roman transliteration is not very comfortable to read, plus all ones TMs in Tibetan script would need to be converted, which would be a complicated process. Also any new users to SDL Trados would expect the Tibetan script to work and encounter the same problem that I had.
Here are tests attached, each is accompanied by the .sdltm that I ran it against. I am happy to communicate with developers or other staff if there are any questions about the Tibetan script from developers.
Thank you very much for taking the time to look into this.
Best wishes,
-Celso
Parents
  • Hi

    Yes, thank you for persevering with these tests.  I played around with this today and can obviously reproduce this, although without your help I would not have been able to identify the likely problems.  I copied  as he may be interested to take a look at this problem too.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hmmm, lately I've been trying and reviewing some other CAT platforms, I noticed on CafeTran I had this exact same problem with the Tibetan tsek punctuation and CafeTran was getting no fuzzy match leverage, however when I brought it up in their forum, they pointed out a configuration setting in the platforms preferences that was labeled, "Additional space characters" after adding the Unicode for the Tibetan tsek punctuation, the fuzzy matches worked splendidly. 

    I'm wondering, there isn't some SDL Trados configuration that can do this? I feel like if there was it would be easy for me to miss while I was testing the platform, but it is a very simple coding correction right? It just needs to recognize for Tibetan that (U+0F0B) = (U+0020). This configuration on CafeTran showed me that this was all that was needed. 

    Please do let me know if there is any update on this, as I would like to give SDL another go since it so far is the only CAT platform I've found that has a good workaround for annotating translations from the editor, thus it is a great platform for academic/scholastic translation.

Reply
  • Hmmm, lately I've been trying and reviewing some other CAT platforms, I noticed on CafeTran I had this exact same problem with the Tibetan tsek punctuation and CafeTran was getting no fuzzy match leverage, however when I brought it up in their forum, they pointed out a configuration setting in the platforms preferences that was labeled, "Additional space characters" after adding the Unicode for the Tibetan tsek punctuation, the fuzzy matches worked splendidly. 

    I'm wondering, there isn't some SDL Trados configuration that can do this? I feel like if there was it would be easy for me to miss while I was testing the platform, but it is a very simple coding correction right? It just needs to recognize for Tibetan that (U+0F0B) = (U+0020). This configuration on CafeTran showed me that this was all that was needed. 

    Please do let me know if there is any update on this, as I would like to give SDL another go since it so far is the only CAT platform I've found that has a good workaround for annotating translations from the editor, thus it is a great platform for academic/scholastic translation.

Children