Is it possible to use and train NMT with only your own TMs locally?

Hi,

I know and use Language Weaver, but is it possible to get some NMT engine trained with your own TMs which is NOT hosted in a cloud but which you can install and use locally or on your own servers and which does not share data with some remote servers/company?

Does anyone have experience with such services? What are the costs for this?

Best regards,

Pascal

emoji
Parents
  • Hi  

    Yes its called Language Cloud Edge.

    While it is not for the every day sort of user, more information about it can be found here: https://www.rws.com/language-weaver/edge/

    Have a good day

    Lyds

    emoji
  • Hi , well, it would certainly help if the LWE support would get back to people if they ask for further information. Right now, due to lacking information and feedback to my answer/request from support I can't still tell if it is what I'm looking for or not.

    emoji
  • Hi ,

    do you know if Opus-CAT MT can be installed on a server or does it need to be installed on the same PC where Trados is installed? I’m using Trados from various locations (Professional network license) and would need to be able to connect to the same dataset from everywhere, so installing it on my own server would be my first choice.

    Br,

    Pascal

    emoji
  •  

    do you know if Opus-CAT MT can be installed on a server

    I don't know.  I run it locally on my own laptop.  The best person to answer this question would be  .

    emoji
  • Hi Pascal,

    I'm the developer of OPUS-CAT, I can fill in some details. First of all, OPUS-CAT can be used to fine-tune base OPUS models (which are available for most language pairs) with your own TMs, this corresponds to LW's Adaption on CPU option that Arnaud mentions in his reply. It's also possible to train models from scratch, but that would require a GPU and also setting up the OPUS-MT training pipeline.

    You can set up OPUS-CAT on a server (although just copying the fine-tuned model to any computer you Trados on is simple), you just need to configure it to allow incoming connections (by changing a setting), and of course you'll need to make the required exceptions in your firewall etc. However, I have only tested it to work over the local network. Once OPUS-CAT is running on the server, you can access it from your local Trados by changing the IP address in the Trados plugin settings.

    In any case, I suggest you give the fine-tuning functionality in OPUS-CAT a try with your data (the setup is very simple, and you can ask for support here or at github.com/.../issues), and see how the output looks. At the very least, it will give you a point of comparison when evaluating similar paid services.

    -Tommi

    emoji
  • Hi ,

    Ok, I’ll have to check that.

    One of the target languages I would like to use it for is very rare (Luxembourgish) so I guess I’ll have to train it by myself. Even if it got already some training model I fear that it relies on old grammar and spelling from before the language reform and on the only available dictionary, that does not differentiate between slang/colloquial and official speech (as seen in many other MT plugins on the market (Google, MemSource …) thus it would not be of much use at all as it would require too much rewriting and replacing. Most available Luxembourgish solutions are crap as they only seem to do word by word translation use too much of the French vocabulary instead of the real Luxembourgish terms and mix German terms in if a word is unknown. Disappointed

    Training on the server could be problem as it does not have GPU installed, but I’m checking with my hoster if it can be added or if migration to another server would be needed. But I guess in worst case I could do training on my PC and then upload everything to the server.

    Luxembourgish grammar and terminology is quite similar to German so I even have some idea of getting better results but I don’t think any NMT does handle that feature right now.

    I’ll need some more information about terminology and glossary management though. I didn’t find much about this on Opus-CAT MT page yet.

    BTW: what are the minimum requirements for the GPU?

    Pascal

    emoji
  • Hi  ,

    just a quick question regarding the Auto Adaptive Language Pairs in release 8.6:

    Will those who use it in Trados Studio via plugin also be able to provide feedback for the automatic retraining of the models?

    Thanks, 

    Natalie

    emoji
  • Hi,

    There are OPUS model for eng-ltz and ltz-eng in the OPUS-CAT model repository (they can be installed via the OPUS-CAT UI), but they seem to be mostly trained with crawled data, so the quality is probably fairly bad. I doubt anyone else will have much more ltz data either, though, so unless you have a very big corpus yourself (at least a million segments), good quality MT is probably not going to be possible.

    If German is quite similar, you might get best results by fine-tuning a German model using the Luxembourgish data you have. It might also help to translate monolingual Luxembourgish data with a German model and the use that synthetic data to supplement your training data (using the correct Luxembourgish as target text and the potentially faulty MT as source text, this method is called backtranslation).

    I'm currently working on adding terminology management to OPUS-CAT, although there is already a possibility to use edit rules to perform string replacements with the machine translation output. Termbase support should be released this year.

    As for the minimum requirements for GPU, since there's not enough data there probably isn't a point to the GPU, but you would need a relatively high-end card from NVIDIA (I can't recommend a specific one offhand, since I train models on computing clusters instead of locally).

    -Tommi

    emoji
  • Hi Tommi,

    if they are crawled, then please dump them they would unfortunately be more harm than help. You can count up to 10 errors or more per sentence of 15 words. Disappointed And that’s still a nicely formulated estimation. I’ve seen professional translators with even more errors in their translations and they didn’t use MT at that time and I QAed most of the available Luxembourgish translators so far. The worst I’ve seen was 15 errors in 10 words by a so called professional translator and most crawled texts are not even written by linguists. So far from around 40 available LB translators only 3 or 4 other translators offer quite good quality but even they are still over the international standard of 3 errors per 1k words. The worst I’ve seen on the net so far is people not even knowing how to correctly spell their own language and with this I don’t mean with one or two typos in the word but with a really awkward spelling that is not even close to the correct word.

    I’ve got a "pretty big" TM for a Luxembourgish translator but as I translate from 6 source languages and I also translate into my other mother tongues German and French, my biggest TM for Luxembourgish after 18 years as a translator is still not bigger than 180k for ENUS (the next one is at 150k for ENUK) yet but I guess it’s one of the biggest available with correct grammar, spelling and terminology right now as I’m one of the leading official linguists who also worked on the language reform and I’m programming an official professional grammar and spell checker for Luxembourgish with the help of some of my official linguist colleagues.

    Well similar does not mean like EN_US vs EN_UK, the spelling is quite different but a lot of Luxembourgish terminology is based on German one and about 80% of German grammar is used for Luxembourgish but style and syntax are at 90% similarity. So one could work with some kind of a word/term replacement model to get quite decent results.

    OK, so with high-end cards you mean around the top 5 RTX range cards that cost 1,000 EUR upwards to 2k+?

    Pascal

    emoji
  • Edit: or even those graphic cards specifically used for crypto mining (data processing) as no graphic rendering should be needed?

    emoji
  • I'm a bit out of the loop about current graphics card models (I haven't built a computer in ten years), but in the research literature people seem to be using RTX 2080/3090 etc. Mid-range cards might be sufficient. However, building a model from scratch with the relatively small amount of data you have would not be straight-forward. It really belongs to the subfield of MT research called low-resource MT where different kinds of tricks are used to compensate for the lack of parallel data. Here's a recent article on the kind of work that goes on in low-resource MT: https://www.statmt.org/wmt21/pdf/2021.wmt-1.44.pdf (this is for North Germanic languages which I guess is somewhat applicable to your situation, since it involves a high-resource closely related language).

    However, if you are interested in MT for strictly practical reasons, e.g. to get more translation efficiency, you probably will not get sufficient quality output with the amount of data available. In that case I would simply suggest that you test out the easy to implement alternatives, such as fine-tuning an OPUS-CAT English-German model with you data, just in case it comes out useful. If you're interested in this academically, you could probably find some MT researcher interested in building a model with your data and co-authoring an article on it, since there's not much other parallel data available for this language. Or if the data (or parts of it) can be shared publicly, you can submit it to the OPUS corpus (https://opus.nlpl.eu/), and it will eventually be used to train new models.

    emoji
  • Unfortunately sharing these data is not possible as the translation results within the TMs are part of NDA. But thanks for the feedback in the PM section. I’ll have to check a few more things but I guess the last option (using German and replace words) will probably be my choice. For the time being.

    emoji
Reply Children
  •  

    I am using Opus-CAT MT for EN-DE, and while it struggles with grammatically complex sentences (DeepL is far superior) it is very useful for simple segments with custom terminology, like product names. I love it, and it speeds up my work a lot.

    I just wanted to point out that the two solutions mentioned here are the opposite ends of a spectrum: RWS's customizable on-premise solution is basically enterprise-level, and priced accordingly. If I was Raytheon and did not want to send all my data across the internet, I'd sure go for that. OPUS-CAT MT is open-source, free. There are many customizable solutions between.

    There is ModernMT, which you can train from your TMs for free, but the service is paid. Don't know if they support Luxembourgish.

    AFAIK Amazon's MT is now customizable, so is Microsoft's, IBM's and Google's. You would have to find a provider that supports Luxembourgish, that might be a bit of a challenge. The basic translation rates are very low with these services, and training a custom model might be a few hundred dollars, just to give you an idea.

    Don't know whether this list is accurate or up-to-date, but they claim that there are 9 MT providers who support Lux: https://machinetranslate.org/luxembourgish

    For what it's worth.

    Daniel

    emoji
  • Hi ,

    well I need a tool that I can use offline with my own TMs, else I’d breach quite some NDAs, translation and PO contracts. ;) Furthermore all providers you listed, even Opus CAT, rely on highly erroneous Luxembourgish references to build up their TM so quality is really way below average. It has to do with the fact that about 95% of existing text on the net are full of grammar, spelling and terminology errors and/or rely on old spelling and grammar from before the language reform and even most translators don’t really know their basics if you take into account that I find about 1 to 10 errors per sentence (the worst was 11 errors in a sentence of 9 words) within texts from about 99% of the translators.

    OPUS is somewhat helpful to me as I can use the German models and adapt output to Luxembourgish with replacement options but results are still quite bad until I’ll be able to create my own terminology rules and replacements from existing termbases via some automated creation.

    I just need get some support with more automatic output in order to translate even faster than now until I manage to get my own MT programming refined a bit more (but I lack time to program it :( ). That should then settle many problems in the MT industry for all languages anyway (even those of NMT) and be far more flexible with new texts and languages with lower translation references even for rare languages like Luxembourgish, Islandic and the like.

    Br,

    Pascal

    emoji
  •  

    It might be that the enterprise solutions provided by RWS are indeed what you are looking for, although, as I said, OPUS-CAT MT and those are at the opposite points of a big range of solutions, so it sounds funny to hear them compared.

    Looking at it positively, you seem to enjoy the privilege of working in a field with very little competition from MT.

    Thinking about OPUS-CAT MT: There are ways of "producing" training sentences. I never really looked into this more closely because with EN-DE I enjoy very well-trained MT systems like DeepL, but I know it is done. So this might be a way for you to go forward.

    Daniel

    emoji
  •  

    I just need get some support with more automatic output in order to translate even faster than now until I manage to get my own MT programming refined a bit more (but I lack time to program it :( ). That should then settle many problems in the MT industry for all languages anyway (even those of NMT) and be far more flexible with new texts and languages with lower translation references even for rare languages like Luxembourgish, Islandic and the like.

    I'm looking forward to this revolutionary development.  It sounds as though you should make time to finish it!

    emoji
  • I don’t say they are bad for all languages but for Luxembourgish they definitely are. ;) I like Language Weaver quite well for German as reference for translations as the results are not that bad since NMT.

    The problem with training is that I would need a fresh model for that language and according to Tommi Nieminen this is not really possible so I have to work with German as target language. With this restriction, I don’t know how much sense training with Luxembourgish TMs and thus mixing languages makes sense.

    Br,

    Pascal

    emoji
  • The problem is I lack funding and time right now. Funding is already more or less on its way with a planned project in collaboration with University of Vienna but for this I need to set up some basic programing (well quite complex as some things need to be invented (or brought to code from inside of my head) to pass a proof of concept for which I lack time as I need to concentrate on translating to make my living and feed my family. :(

    According to the researchers from the university, even if my presented model would only work to 80% it would still lead to some new breakthroughs with AI in other fields than linguistics.

    Br,

    Pascal

    emoji