In the context of this article, a metric will be understood as a measurement. A metric for the evaluation of machine translation output is a measurement of the quality of the output. The quality of a translation is inherently subjective, there is no objective or quantifiable “good”. Therefore, the task for any metric is to assign scores of quality in such a way that they correlate with human judgement of quality. That is, a metric should score highly those translations which humans score highly, and give low scores to those which humans give low scores to. Human judgement is used as the benchmark for assessing the automatic metrics as humans are the end-users of any translation output.
The measure of evaluation for metrics is correlation with human judgement. This is generally done at two levels, at the sentence level, where scores are calculated by the metric for a set of translated sentences, and then correlated against human judgement for the same sentences. And at the corpus level, where scores over the sentences are aggregated for both human judgements and metric judgements, and these aggregate scores are then correlated. Figures for correlation at the sentence level are rarely reported, although Banerjee et al. (2005) do give correlation figures which show that, at least for their metric, sentence level correlation is substantially worse than corpus level correlation.
While not widely reported, it has been noted that the genre, or domain, of a text has an effect on the correlation obtained when using metrics. Coughlin (2003) reports that comparing the candidate text against a single reference translation does not adversely affect the correlation of metrics when working in a restricted domain text.
Even if a metric is shown to correlate well with human judgement in one study, on one corpus, it does not follow that this correlation will carry over to another corpus. Good performance of a metric, across text types or domains, is important for the reusability of the metric. A metric that only works for text in a specific domain is useful, but less useful than one that works across many domains, for the reason that the necessity to create a new metric for every new evaluation or domain is undesirable.
Another important factor in the usefulness of an evaluation metric is to have good correlation, even when working with small amounts of data, that is candidate sentences and reference translations. Turian et al. (2003) point out that, “Any MT evaluation measure is less reliable on shorter translations”, and show that increasing the amount of data improves the reliability of a metric. However, they add that “… reliability on shorter texts, as short as one sentence or even one phrase, is highly desirable because a reliable MT evaluation measure can greatly accelerate exploratory data analysis”.[6]
Banerjee et al. (2005) highlight five attributes that a good automatic metric must possess; correlation, sensitivity, consistency, reliability and generality. Any good metric must correlate highly with human judgement, it must be consistent, giving similar results to the same MT system on similar text. It must be sensitive to differences between MT systems and reliable in that MT systems that score similarly should be expected to perform similarly. Finally, the metric must be general, that is it should work with different text domains, in a wide range of scenarios and MT tasks.
The aim of this subsection is to give an overview of the state of the art in automatic metrics for evaluating machine translation.[7]
Notes
This guide is licensed under the GNU Free Documentation License. It uses material from the Wikipedia.
(Need an webmaster? Click HERE
Leave a Reply