Suffix Arrays

I have implemented the suffix array data structure to identify ngrams in a corpus. The first package is implemented in C++ and is a modification of the code described (and given) by Church and Yamamoto in Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus.

Suffix Array Package

The second is a Perl implementation of the suffix arrays which is quite different than the previous package. This package is available on CPAN.

Array::Suffix


Masks


Masks is a Perl implementation based on the algorithm described in Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora by Gil and Dias. The masks implementation retrieves all contiguous and non-contiguous (positional) ngrams in a corpus. This module is also available on CPAN.

Text::Positional::Ngram

The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.