The METEOR metric is designed to address some of the deficiencies inherent in the BLEU metric. The metric is based on the weighted harmonic mean of unigram precision and unigram recall. The metric was designed after research by Lavie (2004) into the significance of recall in evaluation metrics. Their research showed that metrics based on recall consistently achieved higher correlation than those based on precision alone, cf. BLEU and NIST.
METEOR also includes some other features not found in other metrics, such as synonymy matching, where instead of matching only on the exact word form, the metric will also match on synonyms. For example, if the word “good” appears in the reference and the word “well” appears in the translation, this will be counted as a match. The metric is also includes a stemmer, which lemmatises words and matches on the lemmatised forms. The implementation of the metric is modular insofar as the algorithms that match words are implemented as modules, and new modules that implement different matching strategies may easily be added.
- ^ Lavie (2004)