I developed two software packages which utilize a suffix array data structure to
identify ngrams in a corpus.
Suffix Array Package is a suite of programs that aids in analyzing n-grams in large text files. These programs were created base on the Suffix Array implementation layed out by Church and Yamamoto in Using Suffix Arrays to Compute Term frequency and Document Frequency for All Substrings in a Corpus, Computational Linguistics 2001.
Array::Suffix is a suite of perl modules to retrieve variable length ngrams from large datasets utilizing the suffix array algorithm described by Church and Yamamoto.
is a Perl implementation of the algorithm described in
Masks, Suffix Array-based Data Structures and Multidimensional Arrays to
Compute Positional Ngram Statistics from Corpora by Gil and Dias. The
masks implementation retrieves all contiguous and non-contiguous (positional)
ngrams in a corpus. This module is also available on CPAN.
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.