Is the Future of BLEU Getting Paler?

There is little debate: the machine translation research and practitioner communities are in a funk about BLEU. From recent webinars to professional interviews and scholarly publications, BLEU is being called on the carpet for its technical shortcomings in the face of a rapidly-developing field, as well as the lack of insight it provides to different consumers such as purchasers of MT services or systems.

BLEU itself is used widely, especially in the MT research community, as an outcome measure for evaluating MT. Yet even in that setting, there is considerable rethinking and re-evaluation of the metric, and BLEU has been an active topic of critical discussion and research for some years. And the issue is not limited, of course, to machine translation—the metric is also a topic in NLP and natural language generation discussions generally

BLEU’s strengths and shortcomings are well-known. At its core, BLEU is a string matching algorithm for use in evaluating MT output and is not per se a measure of translation quality. That said, there is no doubt that automated or calculated metrics are of great value, as total global MT output approaches levels of one trillion words per day. 

And few would argue that, in producing and evaluating MT or translation in general, context matters. A general-purpose, public-facing MT engine designed for broad coverage among users and use cases is just that—general-purpose, and likely more challenged by perennial source language challenges such as specific domain style/terminology, informal language usage, regional language variations, and other issues. 

It is no secret that many MT products are trained (at least initially) on publicly available research data and that there are, overall, real thematic biases in those datasets. News, current events, governmental and parliamentary data sets are available across a wide array of language pairs, as well as smaller amounts of data from domains such as legal, entertainment, and lecture source materials such as TED Talks. Increasingly, datasets are available in the IT and technical domains, but there are few public bilingual datasets available that are suitable for major business applications of MT technology such as e-commerce, communication, and collaboration, or customer service. 

Researchers and applied practitioners have all benefited from these publicly-available resources—there is no doubt about that! But the case for clarity is perhaps most evident in the MT practitioner community. 

For example, enterprise customers hoping to purchase machine translation services face a dilemma: how might the enterprise evaluate an MT product or service for their particular domain, and with more nuance and depth than simply relying on marketing materials boasting scores or gains in BLEU or LEPOR? 

This business case is key: imagine yourself as a potential enterprise customer, hoping to utilize MT for communications within the company—in particular, customer care or “customer journey” interchanges. How might you evaluate major vendors of MT services specific to your use case and needs? 

And how do general-purpose engines perform in enterprise cases such as e-commerce product listings, technical support knowledgebase content, social media analysis (Twitter, FB, Instagram), and user feedback/reviews? In particular, “utterances” from customers and customer support personnel in these settings are authentic language, with all of its “messiness.” 

The UTA research group has recently been exploring MT engine performance on customer support content, building a specialized test set compiled from source corpora including email and customer communications, communications via social media, and online customer support. The strings from the starting test set were translated into seven languages (French, German, Hindi, Korean, Portuguese, Russian, Spanish) by professional translators. Then the translated sentences from the test set were utilized as translation prompts in seven language pairs (English-French, English-German, English-Hindi, English-Korean, English-Portuguese, English-Russian, English-Spanish) by four major, publicly-available MT engines via API or web interface. At both the corpus, as well as the individual string level, BLEU, METEOR, and TER scores were generated for each major engine and language pair (not all of the seven languages were represented in all engine products). 

"Imagine yourself as a potential enterprise customer, hoping to utilize MT for communications within the company—in particular, customer care or “customer journey” interchanges. How might you evaluate major vendors of MT services specific to your use case and needs?" 

Our overall question was: does BLEU (or any of the other automated scores) support, say, the choice of engine A over engine B for enterprise purchase when the use case centered on customer-facing and customer-generated communications? To be sure, the output scoring presented a muddled picture. The composite scores of the general-purpose engines clustered within approximately 5-8 BLEU points of each other in most languages. And although we used a domain-specific test set, little in the results would have provided the enterprise-level customer with a clear path forward. As Kirti Vashee has pointed out recently, in responding effectively to the realities of the digital world, “5 BLEU points this way or that is negligible in most high-value business use cases.” 

What are some of the challenges of authentic, customer language? Two known challenges to MT include the formality/informality of language utterances and emotive content. The double-punch of informality and emotion-laden customer utterances pose a particularly challenging case! 

As we reviewed in a recent webinar, customer-generated strings in support conversations or online interactions present a translator with a variety of expressions of emotion, tone, humor, sarcasm, all embedded within a more informal and Internet-influenced style of language. Some examples included: 

Support…I f***ing hate you all. [Not redacted in the original.]
Those late in the day deliveries go “missing” a lot.
Nope didn’t turn up…just as expected…now what dude?
I feel you man, have a good rest of your day!
Seriously, this is not OK.
A bunch of robots who repeat the same thing over & over.

Here one can quickly see how an engine trained primarily with formal, governmental or newspaper sources would be quickly challenged. 

One emerging practice in the field is to combine an automated metric such as BLEU along with human evaluation on a smaller data set, to confirm and assure that the automated metrics are useful and provide critical insight, especially if the evaluation is used to compare MT systems. Kirti Vashee, Alon Lavie, and Daniel Marcu have all pointed this out recently. 

One developing, more nuanced understanding of the value of BLEU may be: automated scores can be seen as initially most useful during MT research and system development, where they are far and away from the most widely-cited standard. The recent Machine Translation Summit XVII in Dublin, for example, had almost 500 mentions or references to BLEU in the research proceedings alone. 

But this measure may be potentially less accurate or insightful when broadly comparing different MT systems within the practitioner world, and perhaps more insightful again to both researcher and practitioner when paired with human or other ratings. As one early MT researcher has noted, “BLEU is easy to criticize, but hard to get away from!” 

Discussions at the recent TAUS Global Content Conference 2019 further developed the ideas of MT engine specialization in the context of the modern enterprise content workflow. Presenters such as SDL and others offered future visions of content development personalization and use in a multilingual world. These future workflows may contain hundreds or thousands of specialized, specially-trained and uniquely maintained automated translation engines and other linguistic algorithms, as the content is created, managed, evaluated, and disseminated globally. 

There is little doubt that the automated evaluation of translation will play a key role in this emerging vision. However, better understanding of the field’s de facto metrics and the broader MT evaluation process in this context is clearly imperative. 

The UTA research group is also interested in MT business cases specific to education and higher education as well. For example, millions of users daily make use of learning materials such as MOOCs—educational content that attracts users across borders, languages, and cultures. A significant portion of international learners come to and potentially struggle with English-language content in edX or other MOOC courses—and thousands of MOOC offerings exist in the world’s languages, untranslated for English-speakers. What role might machine translation potentially play in this educational endeavor?

To hear more about this topic, be sure to register for SDL Connect 2019. Dr. Pete Smith, along with Kirti Vashee, will be presenting, “Quality that Matters: Best Practices for Assessing Machine Translation Quality”.