Opus CAT MT fine-tuning graph

Hi

Thank you for writing the OPUS-CAT MT plugin! It is great.

I have a question regarding the graph that shows fine-tuning progress... how do I read it? Sometimes I get lots of "validation" data points, sometime, like in the model below, it's just a few. Why? And what is "in domain" and "out of domain"?

As for EN-DE translations, after training the model using my existing TMs, the results are great. The one drawback is that the system does not handle tags at all, but the translation quality is very good if the text is not too complex (in which case DeepL still has better results). I am impressed by its use of custom terminology in the TMs used for training.

So thanks a lot. If you intend to have a German version of your web pages or UI, I would be happy to help.

Daniel

  • Hi Daniel,

    Thanks for the feedback, it's great that you find the fine-tuning feature useful. There's a bit of info on the fine-tuning progress on this page: https://helsinki-nlp.github.io/OPUS-CAT/finetuneprogress. I wrote that page a while ago, and now that I read it again, it does not actually explain the concepts. I'll try to cover the concepts below, and I'll add them to the help page later.

    In NMT (and most machine learning), there is a concept called the validation set. It's basically a pair of text files where the first file contains a bunch of source language sentences and the second file the translations of the sentences in the first file. The validation set is used to monitor the progress of the fine-tuning (or other, more extensive training of the NMT models). This is done by periodically translating the source sentences in the validation set with the model in progress, and then comparing the produced machine translations with the translations in the second file of the validation set (which should contain good human translations). The machine translations and the human translations are compared by using different automatic metrics. In OPUS-CAT MT Engine we use just the BLEU metric, which is a standard metric in MT. It has some well-known flaws, but it works well in some contexts.

    The data points you see in the progress graphs are instances of these periodical validations with the BLEU metric. So on the Y axis you have BLEU, and on the X axis the different validation events in chronological order. Conventionally the validation graph is used to check when to stop fine-tuning: once the graph stops showing improvements to the BLEU score, you assume that further fine-tuning will not bring more improvements. OPUS-CAT MT Engine works a bit differently, the fine-tuning stops after all the fine-tuning material has been processed once. At that point the graph is usually still showing upward progress, but my own testing has indicated that the benefits of continuing fine-tuning beyond that point are probably not worth it. The fine-tuning configuration can be altered in the OPUS-CAT MT Engine settings, but I advise against it, since it's very difficult to see what real effect the changes have.

    By default, validation is performed when approximately 40,000 words (or word fragments) of fine-tuning material have been processed. This means the amount of validation events on the X axis will depend on the amount of words in the fine-tuning material. So since your image has four validation events, it probably has 80,000 to 120,000 words of text in it (validation is performed at the start and end of fine-tuning, so the graph does not show the exact amount). With more fine-tuning material, you will have more validation events.

    In OPUS-CAT MT Engine you also have two validation sets: in-domain and out-of-domain. The in-domain set is split from the fine-tuning material, and its purpose is to show how the fine-tuning is affecting the translation performance with text that is similar to the fine-tuning material. The whole point of fine-tuning is to improve this performance with in-domain data, i.e. the data specific to a client or a text genre. The out-of-domain validation set consists of just generic sentences, which do not belong to any specific field of expertise. The out-of-domain set is extracted from the Tatoeba corpus, which contains very simple sentences, like the ones you would find in an old-fashioned language text book (e.g. "Tom enjoys reading books in French"). The purpose of the out-of-domain validation set is to indicate whether the performance of the model with generic texts is being adversely affected by the fine-tuning. Fine-tuning should not affect the out-of-domain performance too much, since that generally indicates problems.

    That's a lot of explanation, but the main take-away is that the progress graph can be used to check for problems in the fine-tuning progress. For instance, if the in-domain performance is not improving much, the fine-tuning material is probably too generic and heterogenuous. If the out-of-domain performance goes down significantly, this indicates that fine-tuning material is very different from the usual texts (this might happen for instance if you accidentally fine-tune an English to German model with German to English data). Since OPUS-CAT MT Engine stops the fine-tuning processing after processing the material once, you will generally see a graph as in your image: the in-domain line trending gently upwards, while the out-of-domain line trends gently downwards (and usually you would expect the in-domain improvement to be slightly greater than the out-of-domain decrease).

    -Tommi

  • You answered all my questions. Thank you very, very much!

    Daniel