I developed two software packages which utilize a suffix array data structure to
identify ngrams in a corpus.
Suffix Array Package
is a suite of programs that aids in analyzing
n-grams in large text files. These programs were created base on the
Suffix Array implementation layed out by Church and Yamamoto in
Using
Suffix Arrays to Compute Term frequency and Document Frequency for All
Substrings in a Corpus, Computational Linguistics 2001.
Array::Suffix
is a suite of perl modules to retrieve variable length ngrams
from large datasets utilizing the suffix array algorithm described by
Church and Yamamoto.
Masks
Text::Positional::Ngram
is a Perl implementation of the algorithm described in
Using
Masks, Suffix Array-based Data Structures and Multidimensional Arrays to
Compute Positional Ngram Statistics from Corpora by Gil and Dias. The
masks implementation retrieves all contiguous and non-contiguous (positional)
ngrams in a corpus. This module is also available on CPAN.
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.