ABSTRACT
Traditional association mining algorithms use a strict definition of
support that requires every item in a frequent itemset to occur in each
supporting transaction. In real-life datasets, this limits the recovery
of frequent itemset patterns as they are fragmented due to random noise
and other errors in the data. Hence, a number of methods have been
proposed recently to discover approximate frequent itemsets in the
presence of noise. These algorithms use a relaxed definition of support
and additional parameters, such as row and column error thresholds to
allow some degree of “error” in the discovered patterns. Though these
algorithms have shown to be successful in finding the approximate
frequent itemsets, a systematic and quantitative evaluation approach to
evaluate the patterns they generate has been lacking. In this paper, we
propose a comprehensive evaluation framework to compare different
approximate frequent pattern mining algorithms. The key idea is to
select the optimal parameters for each algorithm on a given dataset and
use the itemsets
generated with these optimal parameters in order to compare different
algorithms. We also proposed simple variations of some of the existing
algorithms by introducing an additional postprocessing step.
Subsequently, we applied our proposed evaluation framework to a wide
variety of synthetic datasets with varying amounts of noise and a real
dataset to compare existing and our proposed variations of the
approximate pattern mining algorithms. Source code and the datasets
used in this study are made publicly available.
Source Codes for all the algorithms implemented in this paper can be
downloaded from here:
Download source
codes
All the eight different synthetic datasets used in this study can be
downloaded from here:
Download datasets
We only showed results on dataset 6 and dataset 8 in the paper due to
space constraints. Results on other datasets can be downloaded from
here:
Download all results