What is the technology behind upLIFT?

I thought it would be helpful to kick off this thread as we are seeing a few questions in different places about upLIFT and whether this is the same as Lift which was the basis of this technology when  first introduced it.

So my first question would be how many TUs are needed in your Translation Memory to be able to upgrade it for full upLIFT capability with fragment matching and fuzzy match repair?  I read that Lift could do this from a very small number of TUs yet upLIFT does seem to require a bigger starting point.

What questions do you have?

Regards

Paul

Paul Filkin | RWS Group

________________________
Design your own training!

You've done the courses and still need to go a little further, or still not clear? 
Tell us what you need in our Community Solutions Hub

  • Thank you Paul for asking the question. I was wondering whether anything has changed here, because from what I understood, Lift was able to find subsegment matches via linguistic analysis, which rested heavily on the use of bilingual dictionaries.
    upLift, however seems to utilize the statistical approach, i.e. there must be a considerate number of TUs to produce reasonable results.

    As I understand it, each sentence where a phrase appears is a sort of coordinate, helping to locate the appropriate translation in TM. Hence, the more "coordinates", the more precise results.

    With linguistic method the "coordinates" are different. They are mainly bilingual dictionaries and other corpora that tell the software that "this phrase seems to be the most probable translation candidate". Do I understand it correctly?

    It's only my guess, but it looks like upLift is, in this way,an upgraded version of Autosuggest Dictionaries, because now the phrases are added on-the-fly and you have the separate window where you can see the search results, which helps a lot. However, it's not the same Lift as the one from the YouTube presentation and the oen described by Kevin Flanagan on Proz forum, is it?

  • It's true that the differences between Lift and upLIFT merit some discussion. First of all, it's helpful to distinguish between the two types of fragment match that Studio exposes settings for: 'Whole TU' fragment match, and 'TU fragment' fragment match. The first kind is illustrated by Emma's "Electroforesis capilar" example at signsandsymptomsoftranslation.com/.../ (you have "Electroforesis capilar" in the TM as a single segment, and get a translation for it later as part of a larger segment), while the second kind is illustrated by her "acerca de los riesgos y beneficios" example (you have that in the TM embedded in a longer segment, but still get a translation for it later as part of another longer segment). I've previously described these as TM-TDB and DTA subsegment matches, respectively (e.g. www.kftrans.co.uk/.../FillingInTheGaps.pdf)

    With upLIFT, 'Whole TU' fragment matches will work regardless of the size of your TM. Regarding 'TU fragment' matches, the translations are retrieved for them using fine-grained alignment of the TU content. The Lift prototype did perform fine-grained alignment of even a tiny TM (e.g. 1 TU), essentially by using external bilingual electronic dictionaries, and lemmatizers for each language. That worked pretty well, but to get that kind of functionality in SDL software, we’ve built a better approach. It turns out alignment results are better if you build translation models from big parallel corpora, then make an aligner use those instead of electronic dictionaries. We’ve done that, and we’ve got an aligner that can work like Lift did, only for more language pairs (all the pairs for we which offer MT). To begin with, we plan to provide that alignment functionality as a service. If you connect a TM to it, then you get fine-grained alignment from the very first TU you add. It’s likely we’ll have that for cloud- or server-based TMs first. More information in the coming months ...

    So, why release upLIFT in its current form, using an aligner that builds a local translation model for alignment, so needs a TM of a certain size? Mainly because it's still a great leap forward (fast, in-context fragment recall from your 'live' TM) and provides the new functionality now, regardless of language pair (though we have improved Chinese and Japanese support to be released soon). Also, fine-grained alignment is only part of the story; fragment recall also requires considerable TM engineering, which is included in this release and paves the way for future features.

    For now, then, you get full upLIFT capability if you have enough data in the TM to build the translation model (recommended 5,000 TU minimum, though you can try it with as few as 1,000 TUs). I'm hoping Studio users will be pleased with the progress, and will also keep telling us what could be better ...

  • Hi Kevin,

    Thanks for the explanation and I'm glad to see the mention about "Chinese and Japanese support to be released soon". I can't wait to try it out with my language pair!

    I had one question about the numbers:

    recommended 5,000 TU minimum, though you can try it with as few as 1,000 TUs

    - I assume 1000 TUs is hardcoded in the system?

    - Is 5000 just a rough estimate? I was wondering why it was chosen.

    Thanks again,

    Jesse

  • Hi SDL,

    Do you know if the GroupShare API will be extended to do the three steps to prepare a server TM for fragment analysis? As you may well understand, “setting aside a late night or weekend” just won’t do the trick when you literally have got several thousands of TMs. This needs to be done programatically of course.

    Andreas
  • Hi Andreas,

    For GroupShare the tasks to prepare a TM will be automated using Background Tasks. Note: upLIFT for server-termbases will not be in the initial GS 2017 release but follow in a Service Release.

    Thanks,
    Luis

    Best regards,
    Luis Lopes | Principal Product Manager | RWS | (twitter) @Luis___Lopes |

  • Hi, Jesse - yes, the 1,000-TU minimum is currently a hard-coded figure. Along with the 5,000-TU recommendation, they're not totally arbitrary; you can arrive at numbers like these by starting with a big TM, building a model for that, then aligning the TM, and then doing the same for successively-smaller subsets of that TM, and comparing the alignment results (which you can largely automate, but it does need manual inspection, too). If you do that for several language pairs and draw a graph of the alignment accuracy, it varies between language pairs, but generally speaking accuracy drops off rapidly under 1,000, can be acceptable between 1,000 and 5,000, and tends to begin flattening out around 5,000. Of course, it's very dependent on the nature of the content, too (word distributions, degree of repetition, etc.)

    Kevin
  • Since now we know that upLift differs from the original Lift, I'd like to add a suggestion for a future release that would stand somewhere between the present statistical extraction of fragment matches and the oh-so-much-desired linguistic one that produces results even with just one TU in your TM.
    Could we have an option to also retrieve fragment matches WITHOUT their translations in new TMs? In this way we would be shown that there are such fragments in our TM and we could find the translation by ourselves. Yes, I know that it looks similar to the auto-concordance, but the thing is that Studio's concordance in its present incarnation produces too much noise.
  • Hi Wojciech,

    I believe your suggestion is an enhanced version of the following concordance search option.
    However, instead of a "fuzzy" search, you would use the fragment aligner to first identify fragments and then lookup the source segments that contain those fragments to display in the search results window.

    However, the whole point of fragment recall I think is to eliminate having to lookup the translation by hand, so I'm not sure why you would prefer doing that?

  • Hi Jesse,

    As I explained, Studio's concordance produces way too much noise and is simply useless on most occasions.
    Yes, I know the main purpose of fragment recall, but since it's useful only with TMs of at least 1000 TUs, then it could also prove useful on smaller TMs, by signalling the translator that the fragment that he translates exists in the TM, but without providing translation.
    Concordance often fails here. I happen to miss such fragments, especially in larger TUs. Instead, concordance is often more likely to assemble a 65% hit out of prepositions and articles, rather than find a useful three- or four-word phrase embedded in a large TU (that it calculates to be e.g 30% only). Where is sense in that?