The data is made available only for non-profit educational and private research use under the 'fair use' copyright rules. Permission to use this material for any other purpose should be addressed to the respective copyright owners.
Any Questions? Please contact Dan Boley (click here for contact information).
doc-* Document List with Labels, as well as the raw documents (zipped). f* F series Term Frequency Matrix and Word List. j* J series Term Frequency Matrix and Word List. k* K series Term Frequency Matrix and Word List.
Exp Term Frequency Matrix Dimensions Selection # F-series J-series K-series Criteria --- --------- ----------- ------------ ---------------------------- 1 98 x 5623 185 x 10536 2340 x 21839 all words 2 98 x 619 185 x 946 2340 x 7358 quantile filtering (top 25%) 3 98 x 1239 185 x 1763 2340 x 8104 top 20+ words 4 98 x 1432 185 x 2951 top 5+ words plus emphasized (HTML tag) words 5 98 x 399 185 x 449 2340 x 1458 frequent item sets 6 98 x 2641 185 x 5106 all words with TF > 1 7 98 x 1004 185 x 1328 top 20+ & TF > 1 8 98 x 827 185 x 1105 top 15+ & TF > 1 9 98 x 622 185 x 805 top 10+ & TF > 1 10 98 x 332 185 x 474 top 5+ & TF > 1 [ TF == Term Frequency ]
Files named matrix.out contain the full term frequency matrix, one line per row. The i-th line in the file contains all the word counts for the i-th document. The j-th entry on each line is the count for the j-th word.
Files named matrix.sparse contain the term frequency matrix in sparse format, one entry per row. Each line contains three numbers:
<DocNumber> <WordNumber> <Count>All entries not listed are zero.