Suffix Arrays

I developed two software packages which utilize a suffix array data structure to identify ngrams in a corpus.

Suffix Array Package is a suite of programs that aids in analyzing n-grams in large text files. These programs were created base on the Suffix Array implementation layed out by Church and Yamamoto in Using Suffix Arrays to Compute Term frequency and Document Frequency for All Substrings in a Corpus, Computational Linguistics 2001.

Array::Suffix is a suite of perl modules to retrieve variable length ngrams from large datasets utilizing the suffix array algorithm described by Church and Yamamoto.


Text::Positional::Ngram is a Perl implementation of the algorithm described in Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora by Gil and Dias. The masks implementation retrieves all contiguous and non-contiguous (positional) ngrams in a corpus. This module is also available on CPAN.