Is it possible to use and train NMT with only your own TMs locally?

Hi,

I know and use Language Weaver, but is it possible to get some NMT engine trained with your own TMs which is NOT hosted in a cloud but which you can install and use locally or on your own servers and which does not share data with some remote servers/company?

Does anyone have experience with such services? What are the costs for this?

Best regards,

Pascal

emoji
Parents
  • Hi  

    Yes its called Language Cloud Edge.

    While it is not for the every day sort of user, more information about it can be found here: https://www.rws.com/language-weaver/edge/

    Have a good day

    Lyds

    emoji
  • Hi , well, it would certainly help if the LWE support would get back to people if they ask for further information. Right now, due to lacking information and feedback to my answer/request from support I can't still tell if it is what I'm looking for or not.

    emoji
  •  

    I also got a private message (I wonder why this information was not posted here) to have a look at Opus-CAT MT

    Indeed... that is a  great tool.  I almost mentioned it here but thought you wanted a Language Weaver solution for this.  If by "cloud" you mean anything not on your desktop then Opus may be the best solution for you anyway.  Language Weaver Edge doesn't use the Cloud, but it is still server based even if you had it on-premise for your own use.

    emoji
  •  ,

    I didn't ask for Language Weaver explicitly, I only said that I know and use LW but was looking for an NMT that can be trained with own MTs AND is not hosted in a cloud but on my own server or on local server. I guess you misread that part. ;)

    I have an own server on wich I can install whatever I want, so LWE might still be an option, but in order to decide, I would still need quite a lot of information from support or someone using it. I’m also open for any other tool that offers NMT and that can be customized and trained with the own TMS without using any other external online resources but my own online server.

    Regarding LWE it might be interesting in some other way (if like LW it can be matched with another language for languages with smaller TMs) by using customizable replace x by y rules (if that feature is even available).

    emoji
  • Hi  ,

    Language Weaver Edge is our  Machine Translation solution which can be deployed on premise. It supports all features currently available in the Cloud, including the ability to train language pairs with your own translation memories. Language Pairs that can be trained are called Adaptable Language Pairs: you can manually train them; i.e. control when the training takes place, upload your training data using *.tmx, define your own test data set for evaluation, and fully control deployment. This currently requires a GPU and Linux.

    In our next release (8.6 - to be released in October 22), we will support:

    • Adaptation on CPU (the GPU will not longer be a requirement - training will be longer, but will be possible). Training on CPU will be possible on both Windows and Linux.
    • Auto adaptive Language Pairs: with auto adaptive language pairs, the models are constantly trained with data available, including the uploaded translation memories, the dictionaries and also the user feedback provided (collected through the LW Edge UI). The training happens in the background and there is no need to perform manual operations such as deployment of trained models. In Studio, you only reference the auto-adaptive model once in the Language Pair Mapping and you are good to go.

    Feel free to join the Language Weaver Edge section of the community  Language Weaver Edge to get more information. We have a few videos highlighting the User Feedback process - and there is also a recording of a session from Connect 2019 (a bit old but still relevant) on how to best adapt NMT Models using our on-premise solution.

    I hope this is helpful and please let me know if you need any more information! 

    emoji
  • Hi ,

    do you know if Opus-CAT MT can be installed on a server or does it need to be installed on the same PC where Trados is installed? I’m using Trados from various locations (Professional network license) and would need to be able to connect to the same dataset from everywhere, so installing it on my own server would be my first choice.

    Br,

    Pascal

    emoji
  •  

    do you know if Opus-CAT MT can be installed on a server

    I don't know.  I run it locally on my own laptop.  The best person to answer this question would be  .

    emoji
  • Hi Pascal,

    I'm the developer of OPUS-CAT, I can fill in some details. First of all, OPUS-CAT can be used to fine-tune base OPUS models (which are available for most language pairs) with your own TMs, this corresponds to LW's Adaption on CPU option that Arnaud mentions in his reply. It's also possible to train models from scratch, but that would require a GPU and also setting up the OPUS-MT training pipeline.

    You can set up OPUS-CAT on a server (although just copying the fine-tuned model to any computer you Trados on is simple), you just need to configure it to allow incoming connections (by changing a setting), and of course you'll need to make the required exceptions in your firewall etc. However, I have only tested it to work over the local network. Once OPUS-CAT is running on the server, you can access it from your local Trados by changing the IP address in the Trados plugin settings.

    In any case, I suggest you give the fine-tuning functionality in OPUS-CAT a try with your data (the setup is very simple, and you can ask for support here or at github.com/.../issues), and see how the output looks. At the very least, it will give you a point of comparison when evaluating similar paid services.

    -Tommi

    emoji
  • Hi ,

    Ok, I’ll have to check that.

    One of the target languages I would like to use it for is very rare (Luxembourgish) so I guess I’ll have to train it by myself. Even if it got already some training model I fear that it relies on old grammar and spelling from before the language reform and on the only available dictionary, that does not differentiate between slang/colloquial and official speech (as seen in many other MT plugins on the market (Google, MemSource …) thus it would not be of much use at all as it would require too much rewriting and replacing. Most available Luxembourgish solutions are crap as they only seem to do word by word translation use too much of the French vocabulary instead of the real Luxembourgish terms and mix German terms in if a word is unknown. Disappointed

    Training on the server could be problem as it does not have GPU installed, but I’m checking with my hoster if it can be added or if migration to another server would be needed. But I guess in worst case I could do training on my PC and then upload everything to the server.

    Luxembourgish grammar and terminology is quite similar to German so I even have some idea of getting better results but I don’t think any NMT does handle that feature right now.

    I’ll need some more information about terminology and glossary management though. I didn’t find much about this on Opus-CAT MT page yet.

    BTW: what are the minimum requirements for the GPU?

    Pascal

    emoji
  • Hi  ,

    just a quick question regarding the Auto Adaptive Language Pairs in release 8.6:

    Will those who use it in Trados Studio via plugin also be able to provide feedback for the automatic retraining of the models?

    Thanks, 

    Natalie

    emoji
  • Hi,

    There are OPUS model for eng-ltz and ltz-eng in the OPUS-CAT model repository (they can be installed via the OPUS-CAT UI), but they seem to be mostly trained with crawled data, so the quality is probably fairly bad. I doubt anyone else will have much more ltz data either, though, so unless you have a very big corpus yourself (at least a million segments), good quality MT is probably not going to be possible.

    If German is quite similar, you might get best results by fine-tuning a German model using the Luxembourgish data you have. It might also help to translate monolingual Luxembourgish data with a German model and the use that synthetic data to supplement your training data (using the correct Luxembourgish as target text and the potentially faulty MT as source text, this method is called backtranslation).

    I'm currently working on adding terminology management to OPUS-CAT, although there is already a possibility to use edit rules to perform string replacements with the machine translation output. Termbase support should be released this year.

    As for the minimum requirements for GPU, since there's not enough data there probably isn't a point to the GPU, but you would need a relatively high-end card from NVIDIA (I can't recommend a specific one offhand, since I train models on computing clusters instead of locally).

    -Tommi

    emoji
  • Hi Tommi,

    if they are crawled, then please dump them they would unfortunately be more harm than help. You can count up to 10 errors or more per sentence of 15 words. Disappointed And that’s still a nicely formulated estimation. I’ve seen professional translators with even more errors in their translations and they didn’t use MT at that time and I QAed most of the available Luxembourgish translators so far. The worst I’ve seen was 15 errors in 10 words by a so called professional translator and most crawled texts are not even written by linguists. So far from around 40 available LB translators only 3 or 4 other translators offer quite good quality but even they are still over the international standard of 3 errors per 1k words. The worst I’ve seen on the net so far is people not even knowing how to correctly spell their own language and with this I don’t mean with one or two typos in the word but with a really awkward spelling that is not even close to the correct word.

    I’ve got a "pretty big" TM for a Luxembourgish translator but as I translate from 6 source languages and I also translate into my other mother tongues German and French, my biggest TM for Luxembourgish after 18 years as a translator is still not bigger than 180k for ENUS (the next one is at 150k for ENUK) yet but I guess it’s one of the biggest available with correct grammar, spelling and terminology right now as I’m one of the leading official linguists who also worked on the language reform and I’m programming an official professional grammar and spell checker for Luxembourgish with the help of some of my official linguist colleagues.

    Well similar does not mean like EN_US vs EN_UK, the spelling is quite different but a lot of Luxembourgish terminology is based on German one and about 80% of German grammar is used for Luxembourgish but style and syntax are at 90% similarity. So one could work with some kind of a word/term replacement model to get quite decent results.

    OK, so with high-end cards you mean around the top 5 RTX range cards that cost 1,000 EUR upwards to 2k+?

    Pascal

    emoji
Reply
  • Hi Tommi,

    if they are crawled, then please dump them they would unfortunately be more harm than help. You can count up to 10 errors or more per sentence of 15 words. Disappointed And that’s still a nicely formulated estimation. I’ve seen professional translators with even more errors in their translations and they didn’t use MT at that time and I QAed most of the available Luxembourgish translators so far. The worst I’ve seen was 15 errors in 10 words by a so called professional translator and most crawled texts are not even written by linguists. So far from around 40 available LB translators only 3 or 4 other translators offer quite good quality but even they are still over the international standard of 3 errors per 1k words. The worst I’ve seen on the net so far is people not even knowing how to correctly spell their own language and with this I don’t mean with one or two typos in the word but with a really awkward spelling that is not even close to the correct word.

    I’ve got a "pretty big" TM for a Luxembourgish translator but as I translate from 6 source languages and I also translate into my other mother tongues German and French, my biggest TM for Luxembourgish after 18 years as a translator is still not bigger than 180k for ENUS (the next one is at 150k for ENUK) yet but I guess it’s one of the biggest available with correct grammar, spelling and terminology right now as I’m one of the leading official linguists who also worked on the language reform and I’m programming an official professional grammar and spell checker for Luxembourgish with the help of some of my official linguist colleagues.

    Well similar does not mean like EN_US vs EN_UK, the spelling is quite different but a lot of Luxembourgish terminology is based on German one and about 80% of German grammar is used for Luxembourgish but style and syntax are at 90% similarity. So one could work with some kind of a word/term replacement model to get quite decent results.

    OK, so with high-end cards you mean around the top 5 RTX range cards that cost 1,000 EUR upwards to 2k+?

    Pascal

    emoji
Children
  • Edit: or even those graphic cards specifically used for crypto mining (data processing) as no graphic rendering should be needed?

    emoji
  • I'm a bit out of the loop about current graphics card models (I haven't built a computer in ten years), but in the research literature people seem to be using RTX 2080/3090 etc. Mid-range cards might be sufficient. However, building a model from scratch with the relatively small amount of data you have would not be straight-forward. It really belongs to the subfield of MT research called low-resource MT where different kinds of tricks are used to compensate for the lack of parallel data. Here's a recent article on the kind of work that goes on in low-resource MT: https://www.statmt.org/wmt21/pdf/2021.wmt-1.44.pdf (this is for North Germanic languages which I guess is somewhat applicable to your situation, since it involves a high-resource closely related language).

    However, if you are interested in MT for strictly practical reasons, e.g. to get more translation efficiency, you probably will not get sufficient quality output with the amount of data available. In that case I would simply suggest that you test out the easy to implement alternatives, such as fine-tuning an OPUS-CAT English-German model with you data, just in case it comes out useful. If you're interested in this academically, you could probably find some MT researcher interested in building a model with your data and co-authoring an article on it, since there's not much other parallel data available for this language. Or if the data (or parts of it) can be shared publicly, you can submit it to the OPUS corpus (https://opus.nlpl.eu/), and it will eventually be used to train new models.

    emoji
  • Unfortunately sharing these data is not possible as the translation results within the TMs are part of NDA. But thanks for the feedback in the PM section. I’ll have to check a few more things but I guess the last option (using German and replace words) will probably be my choice. For the time being.

    emoji
  •  

    I am using Opus-CAT MT for EN-DE, and while it struggles with grammatically complex sentences (DeepL is far superior) it is very useful for simple segments with custom terminology, like product names. I love it, and it speeds up my work a lot.

    I just wanted to point out that the two solutions mentioned here are the opposite ends of a spectrum: RWS's customizable on-premise solution is basically enterprise-level, and priced accordingly. If I was Raytheon and did not want to send all my data across the internet, I'd sure go for that. OPUS-CAT MT is open-source, free. There are many customizable solutions between.

    There is ModernMT, which you can train from your TMs for free, but the service is paid. Don't know if they support Luxembourgish.

    AFAIK Amazon's MT is now customizable, so is Microsoft's, IBM's and Google's. You would have to find a provider that supports Luxembourgish, that might be a bit of a challenge. The basic translation rates are very low with these services, and training a custom model might be a few hundred dollars, just to give you an idea.

    Don't know whether this list is accurate or up-to-date, but they claim that there are 9 MT providers who support Lux: https://machinetranslate.org/luxembourgish

    For what it's worth.

    Daniel

    emoji
  • Hi ,

    well I need a tool that I can use offline with my own TMs, else I’d breach quite some NDAs, translation and PO contracts. ;) Furthermore all providers you listed, even Opus CAT, rely on highly erroneous Luxembourgish references to build up their TM so quality is really way below average. It has to do with the fact that about 95% of existing text on the net are full of grammar, spelling and terminology errors and/or rely on old spelling and grammar from before the language reform and even most translators don’t really know their basics if you take into account that I find about 1 to 10 errors per sentence (the worst was 11 errors in a sentence of 9 words) within texts from about 99% of the translators.

    OPUS is somewhat helpful to me as I can use the German models and adapt output to Luxembourgish with replacement options but results are still quite bad until I’ll be able to create my own terminology rules and replacements from existing termbases via some automated creation.

    I just need get some support with more automatic output in order to translate even faster than now until I manage to get my own MT programming refined a bit more (but I lack time to program it :( ). That should then settle many problems in the MT industry for all languages anyway (even those of NMT) and be far more flexible with new texts and languages with lower translation references even for rare languages like Luxembourgish, Islandic and the like.

    Br,

    Pascal

    emoji
  •  

    It might be that the enterprise solutions provided by RWS are indeed what you are looking for, although, as I said, OPUS-CAT MT and those are at the opposite points of a big range of solutions, so it sounds funny to hear them compared.

    Looking at it positively, you seem to enjoy the privilege of working in a field with very little competition from MT.

    Thinking about OPUS-CAT MT: There are ways of "producing" training sentences. I never really looked into this more closely because with EN-DE I enjoy very well-trained MT systems like DeepL, but I know it is done. So this might be a way for you to go forward.

    Daniel

    emoji
  •  

    I just need get some support with more automatic output in order to translate even faster than now until I manage to get my own MT programming refined a bit more (but I lack time to program it :( ). That should then settle many problems in the MT industry for all languages anyway (even those of NMT) and be far more flexible with new texts and languages with lower translation references even for rare languages like Luxembourgish, Islandic and the like.

    I'm looking forward to this revolutionary development.  It sounds as though you should make time to finish it!

    emoji
  • I don’t say they are bad for all languages but for Luxembourgish they definitely are. ;) I like Language Weaver quite well for German as reference for translations as the results are not that bad since NMT.

    The problem with training is that I would need a fresh model for that language and according to Tommi Nieminen this is not really possible so I have to work with German as target language. With this restriction, I don’t know how much sense training with Luxembourgish TMs and thus mixing languages makes sense.

    Br,

    Pascal

    emoji
  • The problem is I lack funding and time right now. Funding is already more or less on its way with a planned project in collaboration with University of Vienna but for this I need to set up some basic programing (well quite complex as some things need to be invented (or brought to code from inside of my head) to pass a proof of concept for which I lack time as I need to concentrate on translating to make my living and feed my family. :(

    According to the researchers from the university, even if my presented model would only work to 80% it would still lead to some new breakthroughs with AI in other fields than linguistics.

    Br,

    Pascal

    emoji