What is the technology behind upLIFT?

I thought it would be helpful to kick off this thread as we are seeing a few questions in different places about upLIFT and whether this is the same as Lift which was the basis of this technology when  first introduced it.

So my first question would be how many TUs are needed in your Translation Memory to be able to upgrade it for full upLIFT capability with fragment matching and fuzzy match repair?  I read that Lift could do this from a very small number of TUs yet upLIFT does seem to require a bigger starting point.

What questions do you have?

Regards

Paul

Paul Filkin | RWS Group

________________________
Design your own training!

You've done the courses and still need to go a little further, or still not clear? 
Tell us what you need in our Community Solutions Hub

Parents
  • It's true that the differences between Lift and upLIFT merit some discussion. First of all, it's helpful to distinguish between the two types of fragment match that Studio exposes settings for: 'Whole TU' fragment match, and 'TU fragment' fragment match. The first kind is illustrated by Emma's "Electroforesis capilar" example at signsandsymptomsoftranslation.com/.../ (you have "Electroforesis capilar" in the TM as a single segment, and get a translation for it later as part of a larger segment), while the second kind is illustrated by her "acerca de los riesgos y beneficios" example (you have that in the TM embedded in a longer segment, but still get a translation for it later as part of another longer segment). I've previously described these as TM-TDB and DTA subsegment matches, respectively (e.g. www.kftrans.co.uk/.../FillingInTheGaps.pdf)

    With upLIFT, 'Whole TU' fragment matches will work regardless of the size of your TM. Regarding 'TU fragment' matches, the translations are retrieved for them using fine-grained alignment of the TU content. The Lift prototype did perform fine-grained alignment of even a tiny TM (e.g. 1 TU), essentially by using external bilingual electronic dictionaries, and lemmatizers for each language. That worked pretty well, but to get that kind of functionality in SDL software, we’ve built a better approach. It turns out alignment results are better if you build translation models from big parallel corpora, then make an aligner use those instead of electronic dictionaries. We’ve done that, and we’ve got an aligner that can work like Lift did, only for more language pairs (all the pairs for we which offer MT). To begin with, we plan to provide that alignment functionality as a service. If you connect a TM to it, then you get fine-grained alignment from the very first TU you add. It’s likely we’ll have that for cloud- or server-based TMs first. More information in the coming months ...

    So, why release upLIFT in its current form, using an aligner that builds a local translation model for alignment, so needs a TM of a certain size? Mainly because it's still a great leap forward (fast, in-context fragment recall from your 'live' TM) and provides the new functionality now, regardless of language pair (though we have improved Chinese and Japanese support to be released soon). Also, fine-grained alignment is only part of the story; fragment recall also requires considerable TM engineering, which is included in this release and paves the way for future features.

    For now, then, you get full upLIFT capability if you have enough data in the TM to build the translation model (recommended 5,000 TU minimum, though you can try it with as few as 1,000 TUs). I'm hoping Studio users will be pleased with the progress, and will also keep telling us what could be better ...

Reply
  • It's true that the differences between Lift and upLIFT merit some discussion. First of all, it's helpful to distinguish between the two types of fragment match that Studio exposes settings for: 'Whole TU' fragment match, and 'TU fragment' fragment match. The first kind is illustrated by Emma's "Electroforesis capilar" example at signsandsymptomsoftranslation.com/.../ (you have "Electroforesis capilar" in the TM as a single segment, and get a translation for it later as part of a larger segment), while the second kind is illustrated by her "acerca de los riesgos y beneficios" example (you have that in the TM embedded in a longer segment, but still get a translation for it later as part of another longer segment). I've previously described these as TM-TDB and DTA subsegment matches, respectively (e.g. www.kftrans.co.uk/.../FillingInTheGaps.pdf)

    With upLIFT, 'Whole TU' fragment matches will work regardless of the size of your TM. Regarding 'TU fragment' matches, the translations are retrieved for them using fine-grained alignment of the TU content. The Lift prototype did perform fine-grained alignment of even a tiny TM (e.g. 1 TU), essentially by using external bilingual electronic dictionaries, and lemmatizers for each language. That worked pretty well, but to get that kind of functionality in SDL software, we’ve built a better approach. It turns out alignment results are better if you build translation models from big parallel corpora, then make an aligner use those instead of electronic dictionaries. We’ve done that, and we’ve got an aligner that can work like Lift did, only for more language pairs (all the pairs for we which offer MT). To begin with, we plan to provide that alignment functionality as a service. If you connect a TM to it, then you get fine-grained alignment from the very first TU you add. It’s likely we’ll have that for cloud- or server-based TMs first. More information in the coming months ...

    So, why release upLIFT in its current form, using an aligner that builds a local translation model for alignment, so needs a TM of a certain size? Mainly because it's still a great leap forward (fast, in-context fragment recall from your 'live' TM) and provides the new functionality now, regardless of language pair (though we have improved Chinese and Japanese support to be released soon). Also, fine-grained alignment is only part of the story; fragment recall also requires considerable TM engineering, which is included in this release and paves the way for future features.

    For now, then, you get full upLIFT capability if you have enough data in the TM to build the translation model (recommended 5,000 TU minimum, though you can try it with as few as 1,000 TUs). I'm hoping Studio users will be pleased with the progress, and will also keep telling us what could be better ...

Children
  • Hi Kevin,

    Thanks for the explanation and I'm glad to see the mention about "Chinese and Japanese support to be released soon". I can't wait to try it out with my language pair!

    I had one question about the numbers:

    recommended 5,000 TU minimum, though you can try it with as few as 1,000 TUs

    - I assume 1000 TUs is hardcoded in the system?

    - Is 5000 just a rough estimate? I was wondering why it was chosen.

    Thanks again,

    Jesse

  • Hi, Jesse - yes, the 1,000-TU minimum is currently a hard-coded figure. Along with the 5,000-TU recommendation, they're not totally arbitrary; you can arrive at numbers like these by starting with a big TM, building a model for that, then aligning the TM, and then doing the same for successively-smaller subsets of that TM, and comparing the alignment results (which you can largely automate, but it does need manual inspection, too). If you do that for several language pairs and draw a graph of the alignment accuracy, it varies between language pairs, but generally speaking accuracy drops off rapidly under 1,000, can be acceptable between 1,000 and 5,000, and tends to begin flattening out around 5,000. Of course, it's very dependent on the nature of the content, too (word distributions, degree of repetition, etc.)

    Kevin
  • Since now we know that upLift differs from the original Lift, I'd like to add a suggestion for a future release that would stand somewhere between the present statistical extraction of fragment matches and the oh-so-much-desired linguistic one that produces results even with just one TU in your TM.
    Could we have an option to also retrieve fragment matches WITHOUT their translations in new TMs? In this way we would be shown that there are such fragments in our TM and we could find the translation by ourselves. Yes, I know that it looks similar to the auto-concordance, but the thing is that Studio's concordance in its present incarnation produces too much noise.
  • Hi Wojciech,

    I believe your suggestion is an enhanced version of the following concordance search option.
    However, instead of a "fuzzy" search, you would use the fragment aligner to first identify fragments and then lookup the source segments that contain those fragments to display in the search results window.

    However, the whole point of fragment recall I think is to eliminate having to lookup the translation by hand, so I'm not sure why you would prefer doing that?

  • Hi Jesse,

    As I explained, Studio's concordance produces way too much noise and is simply useless on most occasions.
    Yes, I know the main purpose of fragment recall, but since it's useful only with TMs of at least 1000 TUs, then it could also prove useful on smaller TMs, by signalling the translator that the fragment that he translates exists in the TM, but without providing translation.
    Concordance often fails here. I happen to miss such fragments, especially in larger TUs. Instead, concordance is often more likely to assemble a 65% hit out of prepositions and articles, rather than find a useful three- or four-word phrase embedded in a large TU (that it calculates to be e.g 30% only). Where is sense in that?

  • Hi Wojciech,

    I agree that the current concordance search is not that useful and your idea would help when you have empty/small TMs.

    However, Kevin Flanagan already mentioned they are coming out with a solution to use fragment recall starting from one TU:

    To begin with, we plan to provide that alignment functionality as a service. If you connect a TM to it, then you get fine-grained alignment from the very first TU you add

    Also, I wonder how often translators work with empty/small TMs. I was thinking the majority of the time you are working with TMs with the minimum required number of TUs.

  • Well, I suppose that it all comes down to how you prepare your project. I always have one "big mama" TM that doesn't get updated on-the-fly, but serves as reference. For each project I then create an empty TM that gets updated. After finishing the project I export the TM to the excellent Heartsome TMX Editor to get rid of repetitions etc. Then I export it to my both "big mama" TMs: ENG-POL and POL-ENG.
    I believe that is the most reasonable way of working if you don't want to clutter your large TM and maintain it in a perfect condition.

    Now, since I use a new TM for a new project, this means that all the useful phrases I may use SPECIFICALLY in that project, won't be found by fragment recall, because the TM is too small. Imagine a translation of a specific contract where one phrase appears frequently, but is non-existent in my big-mama. Fragment recall won't help me in this case. That's why I suggested the above. I've seen it working in MemoQ and it really worked.

  • I'm still not entirely clear on this - which may be due to the limits of my own understanding of these processes, but I'll ask my question anyway. Slight smile

    What does upgrading the TM do? It would seem to do more than adding new segments based on found fragments, because if that were all, then... well, if you have a TM with 3000 TUs, then you upgrade it, then you add another 2000 TUs, presumably an upgrade of the 5000-TU TM would give a better result. If upgrading the 3000-TU TM and adding 2000 TUs gives the same result as upgrading a 5000-TU TM, then there's something going on beyond the analysis of the TUs in the TM at the time of upgrade.

    I am presuming that adding TUs to an already-upgraded TM is as good as upgrading the whole, now-larger TM; otherwise, you'd give an option to repeatedely upgrade the TM again and again. And if that presumption is correct, then - how is it working? How are the added-post-upgrade TUs retrospectively ijncorporated into the analysis that provides (non-whole-TU) fragment results?

  • otherwise, you'd give an option to repeatedely upgrade the TM again and again. And if that presumption is correct, then - how is it working?

    You do have an option to do this. If you only ever work interactively you don't need to upgrade again and again as the UpLift process works automatically as you keep adding TUs.  But if you import another 2000 TUs you do need to run the upgrade process again:

    You can see here that I don't need to run it again.  But if I were to import a TMX to this TM then it would tell me the number of unaligned TUs and I would run it again.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • That makes sense.

    The part I think I need to understand better is how "the upLift process works automatically." I have an updated TM. If I add new segments to it as I go, wouldn't it have to rework the whole thing with each new added segment? I thought adding new segments to a TM was a pretty cut-and-dried thing - add a TU, you get... one new TU. Is upLift a process working whenever I have... running in the background whenever Trados is running?

    At least I'm pretty sure of the practical side now - when you can/should re-upgrade, when you don't have to...

    UPD: or is it that adding new TUs is just adding more TUs, and they're not being "upLifted" "in real time," so to speak... and the program sees that you need a critical mass of TUs to make upgrading again worth the effort, so it waits for the TUs to build up?