Improving fuzzy matches

I'm wondering if I could address the community and staff about some issues I am having with fuzzy matching in the editor.
I'm actually migrating here from smartCAT, which is a CAT platform with a lot of rigidity in terms of word processing that ultimately became a deal breaker for the kind of academic work that I want to do. I am still on the TRADOS 30 day trial and seeing how it compares. So far I am very happy with the additional flexibility of TRADOS and it is fullfiling all my word processing needs (making academic annotations was key here, which I was able to finagle by exporting comments in the .docx and then converting them into endnotes with a macro).
However, now I am comparing the fuzzy matching capabilities between smartCAT and TRADOS and I'm not really impressed with the initial statistics I'm getting. Here pasted below are two screen shots of the statistic reports comparing smartCAT and TRADOS, these were done with the exact same file, segmentation, and TM bank (I currently have .tmx files from 32 projects), on both projects I set the minimum fuzzy matching setting to 50%:
smartCAT (freelance account 5/4/2019) :
Screenshot of Trados Studio showing Total Statistics with categories like New, 50%-74%, 75%-84%, and 100% matches, along with numbers of characters, segments, and pages.
SDL Trados 2019 SDR1 - 15.1.2.48878
Screenshot of Trados Studio settings and totals for fuzzy matching, including settings like Minimum Match Value and Fragment Matching Options, followed by a table with types like PerfectMatch, Context Match, and Repetitions.
As you can see TRADOS is registering 50%+ matches, but not nearly as many as smartCAT. So this isn't looking very good in TRADOS, for me the fuzzy matching capablities are really the essential value for using any CAT tool in the first place, but I'm new to the platform and I know there are a lot more options and variables in TRADOS that I'm not familiar with so I'm wondering if there is a way to boost these stats. I am wondering:
(1) If there is a way I can adjust my settings and variables in way to get more matches (they are both set to 50% so I know it's not that)

(2) Could there be a particular issue with the Tibetan script that is giving me a low match rate? Tibetan is an uncommon digital language, so I'm used to seeing a lack of support for the script across various platforms.

(3) Are there any updates, or add-ins that could give augment these fuzzy match capablities, particularly if this is an issue specific to the Tibetan script? I tried updating to 2019 SDR1 - 15.1.2.48878 and updating all the TMs but I'm getting the same stats as before. 
Here is one example for why I'm not getting as good matches in TRADOS vs. smartCAT:
This is a phrase in my test document:
བཅོམ་ལྡན་འདས་རྒྱལ་པོའི་ཁབ་ན་བྱ་རྒོད་ཀྱི་ཕུང་པོའི་རི་ལ་དགེ་སློང་སྟོང་ཉིས་བརྒྱ་ལྔ་བཅུའི་དགེ་སློང་གི་དགེ་འདུན་ཆེན་པོ་དང་ཐབས་ཅིག་ཏུ་བཞུགས་ཏེ།
This is a very close phrase in one of my TMs:
བཅོམ་ལྡན་འདས་རྒྱལ་པོའི་ཁབ་ན་བྱ་རྒོད་ཕུང་པོའི་རི་ལ་དགེ་སློང་སྟོང་ཉིས་བརྒྱ་ལྔ་བཅུའི་དགེ་སློང་གི་དགེ་འདུན་ཆེན་པོ་དང་། བྱང་ཆུབ་ སེམས་དཔའ་ཁྲི་དང་ཐབས་ཅིག་ཏུ་བཞུགས་ཏེ།
In smartCAT this gets a 84% hit, but it is completely missed by TRADOS and doesn't even register as a 50% match. I'm looking through all my TM matches in TRADOS and it only seems to hit when there is a match via a continuous string with no breaks in the middle, but strings with more than one variable substrings, like the one above are not registering. From examples I've seen, it seems like TRADOS would catch the strings with more than one variable substrings if it was Spanish, but perhaps because Tibetan doesn't have spaces between words (it only places dots between syllables) the results aren't so great. Is there anyway to improve this?
Thanks so far, to all the staff and community who have been very patient answering my questions. I would love to make this work, as everything else in TRADOS is really excellent. I would just like to mention that I'm also reviewing the platform for an academic organization I work for that focuses on Tib-Eng tranlsation, if I can confirm a good setup here, I'll be recommending it to several dozen other translators, but this low match rate could really be a deal breaker for me.
Best wishes,
-Celso


Generated Image Alt-Text
[edited by: Trados AI at 4:26 PM (GMT 0) on 28 Feb 2024]
emoji
  • Hi

    I am not one of the experts in this forum, nor do I know SmartCAT, but here are some thoughts:

    1) Your TM hits will depend greatly on your source text segmentation.

    2) I found working with the Concordance very important. Phrases like the one you quote above will be found in the Concordance search. What makes this feature even more useful is that you can just mark some source text and run a concordance search with F3. So if you have a text chunk that you think may have been trandslated before, you can mark it and Studio will pull anything resembling it from the TMs you defined for the project (and tell you where it's from). I found this a powerful feature, esp. when working with more complex phrases. I sometimes deal with English translations of Middle High German or Early New High German sources in my source text and another English translation of the same old German source is in the TM - I found the Concordance in Studio is able to find that original source text.

    3) When I was researching CAT tools for my organization I came to a similar point where I really liked Trados Studio but there was one thing that looked like the end of the road. I am very glad that I did not let that deter me. Studio is a complex and powerful tool, and it took a while to learn how to handle it in order to get out of it what I want. A while ago I reached the point where everything is pretty good and I know the tool will allow me to improve quite a bit still. (Can't see the ceiling yet.)

    Daniel

  • Thanks Daniel for your reply, yes I have taken a close look at my segmentation and all my TMs and files to be translated are segmented in a similar  way using the Tibetan segmentation at a average 10-20 word rate, and I have set it up using tab returns to segment the text, which works well with Trados' segmentation rules settings.

    I have tested the concordance search as well. While it does seem to function, I am still getting some very limited results. It is making me wonder if it is just that Trados' algorithms are not well suited for the Tibetan language for one reason or another (I've noticed this to be a common problem across platforms due to Tibetan's irregular script pattern, it's low digital presence, and consequential lack of attention from programmers).

    To give one example, a very common phrase in my genre "བཅོམ་ལྡན་འདས་རྒྱལ་པོའི་ཁབ་ན་" "The Bhagavan was in Rajagriha" I know for certain is in 6 of my TMs in that exact string. I get some weak results in the concordance search:

    If I search "བཅོམ་ལྡན་འདས་རྒྱལ་པོའི་ཁབ་ན་" I get one hit, but only from one out of the six, so I know it is working to some degree.

    But if I take out the last character, "བཅོམ་ལྡན་འདས་རྒྱལ་པོའི་ཁབ་" the hits drops to zero while I am certain there are at least 6 exact string matches with my TM bank. 

    In contrast the smartCAT concordance search hits 5 out of 6 in the first example and all 6 in the second example. 

    Anyways thanks for you comments and suggestions, I haven't given up yet and I will try some more experimenting to see if there is a way to improve the results.