Language identification

Language identification is the process of determining which natural language given content is in. Traditionally, identification of written language – as practiced, for instance, in library science – has relied on manually identifying frequent words and letters known to be characteristic of particular languages. More recently, computational approaches have been applied to the problem, by viewing language identification as a kind of text categorization, a Natural Language Processing approach which relies on statistical methods.


  • Benedetto, D., E. Caglioti and V. Loreto. Language trees and zipping. Physical Review Letters, 88:4 (2002) [2], [3], [4].
  • Cilibrasi, Rudi and Paul M.B. Vitanyi. “Clustering by compression”. IEEE Transactions on Information Theory 51(4), April 2005, 1523-1545. [5]
  • Dunning, T. (1994) “Statistical Identification of Language”. Technical Report MCCS 94-273, New Mexico State University, 1994.
  • Goodman, Joshua. (2002) Extended comment on “Language Trees and Zipping”. Microsoft Research, Feb 21 2002. (This is a criticism of the data compression in favor of the Naive Bayes method.) [6]
  • Poutsma, Arjen. (2001) Applying Monte Carlo techniques to language identification. SmartHaven, Amsterdam. Presented at CLIN 2001.
  • The Economist. (2002) “The elements of style: Analysing compressed data leads to impressive results in linguistics [7]
  • Survey of the State of the Art in Human Language Technology, (1996), section 8.7 Automatic Language Identification [8]


  • PetaMem Language Identification: ngram, nvect and smart methods [9]
  • Links to LID tools by Gertjan van Noord [10]
  • Implementation of an n-gram based LID tool in Python and Scheme by Damir Cavar [11]
  • Xerox Language Identifier [12]
  • What Language Is This? Online language identification tool written in JavaScript [13]

This guide is licensed under the GNU Free Documentation License. It uses material from the Wikipedia.

Need an webmaster? Click HERE

Leave a Reply

Your email address will not be published. Required fields are marked *