I have implemented the suffix array data structure to identify ngrams
in a corpus. The first package is implemented in C++ and is a modification
of the code described (and given) by Church and Yamamoto in Using Suffix
Arrays to Compute Term Frequency and Document Frequency for All Substrings
in a Corpus.
Suffix Array Package
The second is a Perl implementation of the suffix arrays which is quite
different than the previous package. This package is available on CPAN.
Array::Suffix
Masks
Masks is a Perl implementation based on the algorithm described in Using
Masks, Suffix Array-based Data Structures and Multidimensional Arrays to
Compute Positional Ngram Statistics from Corpora by Gil and Dias. The
masks implementation retrieves all contiguous and non-contiguous (positional)
ngrams in a corpus. This module is also available on CPAN.
Text::Positional::Ngram