One of the constituent parts of the ALPAC report was a study comparing different levels of human translation with machine translation output, using human subjects as judges. The human judges were specially trained for the purpose. The evaluation study compared an MT system translating from Russian into English with human translators, on two variables.
The variables studied were “intelligibility” and “fidelity”. Intelligibility was a measure of how “understandable” the sentence was, and was measured on a scale of 1—9. Fidelity was a measure of how much information the translated sentence retained compared to the original, and was measured on a scale of 0—9. Each point on the scale was associated with a textual description. For example, 3 on the intelligibility scale was described as “Generally unintelligible; it tends to read like nonsense but, with a considerable amount of reflection and study, one can at least hypothesize the idea intended by the sentence”
Intelligibility was measured without reference to the original, while fidelity was measured indirectly. The translated sentence was presented, and after reading it and absorbing the content, the original sentence was presented. The judges were asked to rate the original sentence on informativeness. So, the more informative the original sentence, the lower the quality of the translation.
The study showed that the variables were highly correlated when the human judgement was averaged per sentence. The variation among raters was small, but the researchers recommended that at the very least, three or four raters should be used. The evaluation methodology managed to separate translations by humans from translations by machines with ease.
The study concluded that, “highly reliable assessments can be made of the quality of human and machine translations”.