Term Recognition is not working for Tibetan language

I'm testing Trados for Tibetan translation and I'm trying to get a proof of concept for using a Tibetan termbase. It seems I cannot get any term recognition for a Tibetan source text. I create a simple bilingual termbase at the beginning of the project and then test by adding the simple term  ཞི་མི་ = "cat". When I run a parallel test from English to French it works fine:

English to French:

Screenshot of Trados Studio showing English to French term recognition with 'cat' in English matched to 'chat' in French.

But, no results with Tibetan to English:

Screenshot of Trados Studio with a Tibetan to English termbase entry for 'cat' but no term recognition in the translation column.

I've tried a few different things here. Checking the "Use word-based tokenization for Asian source text" does not seem to help.

I also thought it might be related to the editor not recognizing the Tibetan tsheg punctuation (the little dots between words) as word boundaries--this issue was previously a problem with the TM matching, but was fixed in the 2022 SR1 update. However, as in the example above using roman transliteration with spaces: "zhi mi" = "cat" does not work either.

I know the terms are being added to the termbase because I see them in the termbase viewer and they come up in the termbase search, but they are still not being recognized.

It seems to only occur when the source is Tibetan. Does anyone have some insight into why this isn't working?



Generated Image Alt-Text
[edited by: Trados AI at 2:21 PM (GMT 0) on 5 Mar 2024]
emoji
Parents
  •   

    Asian languages can be tricky because there are no word boundaries, and I think this would almost certainly cause a problem here.  It's also very difficult for us to test properly because, as I have learned this morning with only 30 mins or so of investigation, Tibetan is a very complex (and interesting) language with its own unique grammatical rules and context-dependent meanings.  I had a go to try and test this as follows.  Two files, one with no spaces, only the "tsek" as you mentioned (I think!) and the second some forced spaces to try and test if this made a difference... it didn't seem to:

    Screenshot of Trados Studio showing no translation results for Tibetan text with a red arrow pointing to the term recognition failure.

    1. the term recognition does not work
    2. termbase search seems to find everything... although I did set a very low fuzzy to try and help so it may have found far more than it should given these characters are not in the the Tibetan translation of "he was not there" as far as I can see.
    3. Find & replace can find these chars so they are definitely there

    Apologies for the what is most likely completely non-sensical translations and terms, but I just wanted to have something to be able to replicate this problem for technical support to review.  If you have a better sample termbase and source text with translation please send me the termbase and the sdlxliff and then I can use them to raise this problem?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 2:21 PM (GMT 0) on 5 Mar 2024]
  • Thank you for your responses and considering this issue. I see that you are recognizing the issue as I described. 

    Yes, Tibetan script is a bit unusual, however there is technical fix to the software that would resolve this. If the tsek character is read by all searches as equivalent to a space character in any Western language. In my second screen shot above, the three sentences are replicated by the most common Roman transliteration scheme (known as Wylie transliteration), in which each tsek character is read as a space.

    So for example, for the word ཞི་མི་ ("cat"), the Wylie equivalent would be "zhi mi". However, in my example above, the Wylie transliteration was not registered in the term recognition either. As a test, I reproduced the same scenario with a Tibetan source, but labeled the Tibetan source as French and set the termbase as French to English. In this case the term recognition did get a hit for the the Wylie transliteration although not the Tibetan script:

    Screenshot of Trados Studio interface showing no translation results or automated translation servers. Termbase Viewer displays 'zhi mi' with 'Add Term' options for French and English.

    So it appears there are two issues here:

    (1) That the tsek character (the little dots between Tibetan letters) are not recognized as word boundaries.

    (2) If the source is labeled as Tibetan, then the spaces in Roman transliteration are not recognized either (or perhaps it's an issue of Roman characters)

    In 2019 I posted about this, but I had mentioned it with regard to fuzzy matching TMs. In the Trados Studios 2022 SR1 release notes, it seems these were corrected:

    "*Fixed an issue where unicode "TIBETAN MARK INTERSYLLABIC TSHEG" character would not be recognized as a word delimiter. (CRQ-15202)"

    And indeed I confirmed that now the TM fuzzy matching works very well with ver. 17.1+

    However, as this case shows it seems that recognizing the tsek as a word boundary in term recognition is still an issue.

    I would be happy to send any samples to your tech support team if that would be helpful. Or answer any questions about the Tibetan script.

    In general I think the term recognition could be 95% resolved, and in my opinion pragmatically useful and functional, if the term recognition just recognized the tseg as a word boundary equivalent to a space, and perhaps it should be noted that the occasional vertical line punctuation "།" called a "shad" should be recognized as equivalent to a Western period or comma. It's certainly possible to go beyond this and I could point out Git repositories with Tibetan word tokenizers, but the simple fix of the way these two character are processed would make the term recognition pragmatically functional for Tibetan translation.  

    For the project I'm involved with I believe this term recognition would be key, as we organize a few dozen translators and looking into ways of making our terminology more consistent. We have a compiled glossary in a spreadsheet and various formats, so a termbase that could functionally highlight instances in a Tibetan Unicode source would be a useful resource for our project.

    Anyways, thank you again for your attention and prompt reply. If there is anything I can send for report or investigation to your tech support please let me know.  

    emoji


    Generated Image Alt-Text
    [edited by: Trados AI at 2:21 PM (GMT 0) on 5 Mar 2024]
  •  

    If there is anything I can send for report or investigation to your tech support please let me know.  

    What would be really helpful is the following:

    1. a short translated SDLXLIFF... Tibetan to any European language
    2. a small termbase that should, in theory, pick up terms in Tibetan

    It's very hard to find any resource that can translate into Tibetan for us to test.  You can send this to pfilkin at sdl dotcom

    Also, you might be interested to have a play with this plugin:

    https://appstore.rws.com/Plugin/59

    This does do a better job than MultiTerm for term recognition in Tibetan... although this is also not perfect and definitely needs work.  But we might be able to do some work on this plugin to better support this language... it's also opensource in case you have access to developers who could do this too?

    So for example:

    Screenshot showing Tibetan getting picked up with the Term Excelerator plugin.

    1. Cat is recognised, but only when it's at the start of the sentence.  But even this is an improvement over MultiTerm.
    2. Cat being picked up in the term recognition window
    3. The Term Excelerator Termbase Viewer (this is also editable so quite neat...)

    It also gets me this when I add spaces to where I think the words end:

    Screenshot showing two terms recognised

    Here I get two terms recognised.  So this makes me think that we might be able to do some work on this plugin to recognise the "tsek" and any other important markers and improve the ability to use Tibetan.  This could then serve as a useful proof of concept that "might" transpose to MultiTerm, or at least provide the dev team with a solution.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
Reply
  •  

    If there is anything I can send for report or investigation to your tech support please let me know.  

    What would be really helpful is the following:

    1. a short translated SDLXLIFF... Tibetan to any European language
    2. a small termbase that should, in theory, pick up terms in Tibetan

    It's very hard to find any resource that can translate into Tibetan for us to test.  You can send this to pfilkin at sdl dotcom

    Also, you might be interested to have a play with this plugin:

    https://appstore.rws.com/Plugin/59

    This does do a better job than MultiTerm for term recognition in Tibetan... although this is also not perfect and definitely needs work.  But we might be able to do some work on this plugin to better support this language... it's also opensource in case you have access to developers who could do this too?

    So for example:

    Screenshot showing Tibetan getting picked up with the Term Excelerator plugin.

    1. Cat is recognised, but only when it's at the start of the sentence.  But even this is an improvement over MultiTerm.
    2. Cat being picked up in the term recognition window
    3. The Term Excelerator Termbase Viewer (this is also editable so quite neat...)

    It also gets me this when I add spaces to where I think the words end:

    Screenshot showing two terms recognised

    Here I get two terms recognised.  So this makes me think that we might be able to do some work on this plugin to recognise the "tsek" and any other important markers and improve the ability to use Tibetan.  This could then serve as a useful proof of concept that "might" transpose to MultiTerm, or at least provide the dev team with a solution.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
Children