CONTENTS

This is a directory containing the datasets used in the following papers (among others). When using the data, please cite these papers. Thank you.

The data is made available only for non-profit educational and private research use under the 'fair use' copyright rules. Permission to use this material for any other purpose should be addressed to the respective copyright owners.

Any Questions? Please contact Dan Boley (click here for contact information).


Directories


doc-*   Document List with Labels, as well as the raw documents (zipped).
f*      F series Term Frequency Matrix and Word List.
j*      J series Term Frequency Matrix and Word List.
k*      K series Term Frequency Matrix and Word List.


Explanation


Exp     Term Frequency Matrix Dimensions         Selection
 #   F-series      J-series         K-series     Criteria
---  ---------    -----------    ------------    ----------------------------
 1   98 x 5623    185 x 10536    2340 x 21839    all words
 2   98 x 619     185 x 946      2340 x 7358     quantile filtering (top 25%)
 3   98 x 1239    185 x 1763     2340 x 8104     top 20+ words
 4   98 x 1432    185 x 2951                     top 5+ words plus
                                                 emphasized (HTML tag) words
 5   98 x 399     185 x 449      2340 x 1458     frequent item sets
 6   98 x 2641    185 x 5106                     all words with TF > 1
 7   98 x 1004    185 x 1328                     top 20+ & TF > 1
 8   98 x 827     185 x 1105                     top 15+ & TF > 1
 9   98 x 622     185 x 805                      top 10+ & TF > 1
10   98 x 332     185 x 474                      top 5+ & TF > 1

                                                 [ TF == Term Frequency ]

Format of Term Frequency Matrix

All matrix files are plain ASCII text of integers separated by white space.

Files named matrix.out contain the full term frequency matrix, one line per row. The i-th line in the file contains all the word counts for the i-th document. The j-th entry on each line is the count for the j-th word.

Files named matrix.sparse contain the term frequency matrix in sparse format, one entry per row. Each line contains three numbers:

        <DocNumber>   <WordNumber>   <Count>
All entries not listed are zero.