Do you have GDPR compliance issues ?

Check out Legiscope a GDPR compliance software, that will save you weeks of work, automating your documentation, the training of your teams and all processes you need to keep your organisation compliant with privacy regulations

Libtextcat

Jul 20, 2023

Language guessing by N-Gram-Based Text Categorization

Libtextcat is a library with functions that implement the classification technique described in Cavnar & Trenkle, “N-Gram-Based Text Categorization” [1]. It was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy.

The central idea of the Cavnar & Trenkle technique is to calculate a “fingerprint” of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency. Fingerprints are compared with a simple out-of-place metric.

[1] The document that started it all William B. Cavnar & John M. Trenkle 1994 N-Gram-Based Text Categorization, <http//citeseer.ist.psu.edu/68861.html>.

Checkout these related ports:

Zxing-cpp - ZXing C++ Library for QR code recognition
Zu-hunspell - Zulu hunspell dictionaries
Zu-aspell - Aspell Zulu dictionary
Zq - Easier and faster alternative to jq
Zorba - General purpose C++ XQuery processor
Zenxml - Simple C++ XML Processing
Zed - Command-line tool to manage and query Zed data lakes
Yq - Command-line YAML and XML processor, jq wrapper for YAML/XML documents
Yould - Pronounceable word generator
Yodl - Easy to use but powerful document formatting/preparation language
Yi-hunspell - Yiddish hunspell dictionaries
Yi-aspell - Aspell Yiddish dictionary
Yelp-xsl - DocBook XSLT stylesheets for yelp
Yelp-tools - Utilities to help manage documentation for Yelp and the web
Ydiff - Diff readability enhancer for color terminals

RECENT POSTS

Do you have GDPR compliance issues ?

Libtextcat

Language guessing by N-Gram-Based Text Categorization

Checkout these related ports: