Suffix Arrays

I have implemented the suffix array data structure to identify ngrams in a corpus. The first package is implemented in C++ and is a modification of the code described (and given) by Church and Yamamoto in Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus.

Suffix Array Package

The second is a Perl implementation of the suffix arrays which is quite different than the previous package. This package is available on CPAN.

Array::Suffix


Masks


Masks is a Perl implementation based on the algorithm described in Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora by Gil and Dias. The masks implementation retrieves all contiguous and non-contiguous (positional) ngrams in a corpus. This module is also available on CPAN.

Text::Positional::Ngram