May 26, 2018

Trained module to guess a document’s language

TextLanguageGuess guesses a document’s language. Its implementation is simple Using “TextExtractWords” and “LinguaStopWords” from CPAN, it determines how many of the known stopwords the document contains for each language supported by “LinguaStopWords”.

Each word in the document recognized as stopword of a particular language scores one point for this language.

The “language_guess” function takes a document as a parameter and returns the abbreviation of the language that it is most likely written in.

