Loading...

How to evaluate the quality of automatic translation

Krzysztof Jassem July 14, 2020

How can we determine whether an automatic translation system is doing its job – that is, translating texts correctly and preserving the original meaning? How should we compare the quality of two translation systems so as to choose the one that best meets our needs? I will be trying to answer these questions in this blog.

Human evaluation

A translation may be evaluated by a human. Here, some predefined quality scale is used – usually a five-point scale, where a score of 5 denotes the highest quality. The translation of each sentence is scored separately, and finally the arithmetic mean of the scores for all sentences is calculated. Often, two components of quality are distinguished: the faithfulness of the translation to the original, and the correctness and fluency of the translated text.

Automatic evaluation using WER

Human evaluation is nonetheless an expensive and time-consuming task, and moreover is subjective in nature. A much cheaper solution, and one that is independent of human moods and biases, is automatic evaluation. In this case the translation is compared with a “gold standard” – an ideal translation produced by specialists. In the 20th century a popular metric used for this kind of evaluation was the Word Error Rate (WER). This is computed based on the number of changes – addition, deletion or substitution of a word – that would need to be made to the sentence proposed by the system in order to obtain the “gold standard” version. This number is then divided by the total number of words in the sentence being translated.

Let’s see how this method works by looking at a concrete example:

Sentence to be translated: Prawo zaskarżania nie przysługuje byłym członkom zarządu spółki.

“Gold standard” translation: The right to appeal shall not be granted to former members of the management board.

Translation proposed by the system: The right of appeal is not available to former members of the management board.

To go from the machine translation to the gold standard, we need to make three substitutions of words (ofto, isshall, availablegranted) and to insert one additional word (be). The length of the gold standard version is 15 words. Hence the WER for this proposed translation is 4/15. Clearly, the higher the value of the WER, the lower the quality of the translation.

Automatic evaluation using BLEU

Today, the metric most commonly used in automatic evaluation is BLEU (Bilingual Evaluation Understudy), proposed by IBM in 2002. Its value is directly proportional to the quality of the translation, and indicates what proportion of the machine translation corresponds to the gold standard. For instance, in the example given above, the fragments that correspond are The right and to former members of the management board; the remaining elements of the translation deviate from the standard. The value of the BLEU metric always lies in the interval from 0 to 1, and is often stated as a percentage.

Translation quality achieved by the world’s leading systems

The tables below contain the results of competitions in translating press reports, held at the Workshop for Machine Translation (WMT) in 2017 and 2018. This example shows that there was a significant improvement in quality over a single year.

name BLEU
uedim-nmt37,00
KIT36,48
RWTH-nmt-ensemble35,09
online-A34,97
SYSTRAN34,88
online-B34,37
LIUM-NMT31,75
C-3MA30,64
online-G30,09
TALP-UPC29,95
online-F19,49
Table 1: Competition results, WMT 2017
name BLEU
RWTH50,17
UCAM49,88
NTT48,71
JHU47,57
MLLP-UPV47,51
uedin45,87
Ubiqus-NMT45,57
online-B45,47
online-A43,34
LMU-nmt43,17
online-Y41,69
NJUNMT-private39,72
online-G36,39
online-F23,86
RWTH-UNSUPER20,35
LMU-unsup19,12
Table 2: Competition results, WMT 2018

Translating Polish

In 2018, a group of researchers at Adam Mickiewicz University in Poznań, in collaboration with the company POLENG (now PWN AI), carried out two experiments to evaluate the quality of translations of texts in a particular field, from and into Polish.

Specialist translation – from a broad field

In the first experiment the subject field was defined in broad terms, and the number of training texts supplied by the client was relatively small. PWN AI engineers independently collected a sufficient number of texts to enable the system to be trained.

The final training set consisted of:

  • 60,000 sentence pairs supplied by the client;
  • 7.2 million sentence pairs collected by PWN AI engineers.

The system was trained for translation from Polish to English and from English to Polish. The results of the experiments, in the form of values of the BLEU metric expressed as percentages, were as follows:

Polish–English translation English–Polish translation
35,80 39,90
Table 3: Automatic evaluation of specialist translation from and into Polish

The translation results were also subjected to human evaluation, where approximately 500 sentences were assessed on a scale of 1 to 5, considering two aspects of quality: the faithfulness of the translation and its correctness. The following results were obtained:

aspect Polish–English translation English-Polish translation
faithfulness 4,23 3,90
correctness 3,94 3.74
Table 4. Human evaluation of specialist translation from and into Polish

It is interesting to note that according to the automatic BLEU metric the translations from English to Polish were assessed as being of better quality, while in human evaluation the translations from Polish to English scored higher. This may be because the evaluators were Polish, and took a more critical view of translations written in their native language.

Highly specialised translation – from a narrow field

The second experiment used a training set containing 1.2 million sentences, all supplied by one client. This time a comparison was made between two translation systems: one statistical, and one based on neural networks. Only translation from English to Polish was tested. A similar evaluation was also made for the Google Translate system, which is intended to handle general texts. The aim was to determine which of the translation methods produced better results in case of a relatively small database of training texts.

The following results were obtained:

system BLEU percentage
statistical 55.23
neural network 51.66
Google Translate 21.37
Table 5. Comparison of quality of specialist translation into Polish

Both systems that had been trained on specialist texts produced results more than twice as high as that of the system designed for general translation. The results obtained based on a small specialist corpus were also better than those from the previous experiment, where the system was trained using a larger training set of texts from a more broadly defined field.

Surprisingly, the statistical system returned better results than the neural network system. Given this fact, it was decided to carry out an additional human evaluation. Here, two independent verifiers compared the translations supplied by the two systems, without being told which came from which type of system. For each of 4000 translation pairs, the verifier indicated which of the translations was superior, or declared a tie. The results were as follows:

winner number of sentences as percentage
statistical translation 829 20.73%
neural network translation 1248 31.20%
tie 1923 48.08%
Table 6. Human comparison of the quality of statistical and neural network translation systems

Which is better – statistical or neural network translation?

When scored by humans, the neural network method clearly outperformed the statistical method. This implies that the neural network produces better results according to human evaluation than would be indicated by the automatic BLEU metric. This fact, which was known previously, is explained by the specific construction of the BLEU metric, which favours “locally correct” translations. Neural network translation is oriented more towards analysing the relationships between words that are more distant from each other.

What comes next?

The quality of machine translation is constantly improving. It can therefore be expected to win an increasing share of the market. Automatic translation will be used primarily for technical and specialist texts, while humans will remain indispensable for the translation of general texts or those of mixed type. In the case of specialist translation, humans will work mainly on the post-editing of texts proposed by computers.

Neural network translation will remain the dominant technology, at least for the next few years, and constant progress will be achieved through the continued development of neural network architecture.

How to evaluate the quality of automatic translation

Contents

Neural Machine Translation System Our Translator