I have implemented the suffix array data structure to identify ngrams
in a corpus. The first package is implemented in C++ and is a modification
of the code described (and given) by Church and Yamamoto in Using Suffix
Arrays to Compute Term Frequency and Document Frequency for All Substrings
in a Corpus.
Suffix Array Package
The second is a Perl implementation of the suffix arrays which is quite
different than the previous package. This package is available on CPAN.
Array::Suffix
Masks
Masks is a Perl implementation based on the algorithm described in Using
Masks, Suffix Array-based Data Structures and Multidimensional Arrays to
Compute Positional Ngram Statistics from Corpora by Gil and Dias. The
masks implementation retrieves all contiguous and non-contiguous (positional)
ngrams in a corpus. This module is also available on CPAN.
Text::Positional::Ngram
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.