TEXT CATEGORIZATION Corpora

The core of any Text Categorization (TC) experimentation is the final accuracy and the possibility to compare it against previous work. The Reuters corpus offers this possibility as it has been largely used in the TC work. Unfortunately, it is not so easy to pass from its downloadable format to the several versions used in literature: Apte' split, Apte' split 90 categories, Apte' split 115 (or 135) categories, Apte' split 10 categories, Reuters-22173, Reuters Yang preparation (Reuters3). An attempt to describe all Reuters versions has been made in [Sebastiani, 2002], even if there is a disagreement with [Yang, 1999] on Reuters3 about the numbers of documents in training and testing. Another critical point is to follow the Apte' split preparation accurately. Indeed, to get the exact numbers of documents for each category and for the final split, usually, requires a lot of time.

In order to help researchers that approach the Text Categorization world, we make available the standard Apte' split in an easy to process format. The categories are expressed as different directories. In each directory are stored the set of files (one for each document) associated with the target category. As in Reuters there are non-labeled documents we stored all of them in the directory unknown. The document file names are increasing numbers (starting from 0) over all categories (this enables a fast document indexing). The training/testing split is provided by means of two different main directories (test and training).

The same annoying corpus preparation problems affect also other two well known corpora: Ohsumed and 20NewsGroups (see [Moschitti and Basili, 2004; Moschitti, 2003a; Moschitti, 2003b]), thus we provide even them in the final format. Hereafter, there are the corpora descriptions along with the download link:

Reuters-21578 collection Apte' split (available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html). It includes 12,902 documents for 90 classes, with a fixed splitting between test and training data (3,299 vs. 9,603). This is the most used version as also confirmed by the Table VI at page 38 in [Sebastiani, 2002]. To obtain from it the Reuters 10 categories Apte' split it is enough to select the 10 top-sized categories, i.e. Earn, Acquisition, Money-fx, Grain, Crude, Trade, Interest, Ship, Wheat and Corn.

Download Here

- 90 categories: according to literature, e.g. [Joachims, 1997], they are the categories with at least 1 training and 1 test documents. After the category selection the exact number of training documents decreases to 9,598.

- 115 categories: according to literature, e.g. [Sebastiani, 2002], they are the categories with at least 1 training documents.

Ohsumed collection (available at ftp://medir.ohsu.edu/pub/ohsumed): it includes medical abstracts from the MeSH categories of the year 1991. In [Joachims, 1997] were used the first 20,000 documents divided in 10,000 for training and 10,000 for testing. The specific task was to categorize the 23 cardiovascular diseases categories. After selecting the such category subset, the unique abstract number becomes 13,929 (6,286 for training and 7,643 for testing). As current computers can easily manage larger number of documents we make available all 34,389 cardiovascular diseases abstracts out of 50,216 medical abstracts contained in the year 1991.

Download Here

- Cardiovascular diseases abstracts (in the first 20,000 abstracts of the year 1991)

- All Cardiovascular diseases abstracts (in all 50,216 abstracts of the year 1991)

- Category Description

20Newsgroups corpus (available at http://www.ai.mit.edu/people/jrennie/20Newsgroups/.): it contains 19997 articles for 20 categories taken from the Usenet newsgroups collection. We used the subject and the body of each message only. Some of the newsgroups are very closely related to each other (e.g., IBM computer system hardware / Macintosh computer system hardware), while others are highly unrelated (e.g. misc forsale / social religion and christian). This corpus is different from the previous corpora because it includes a larger vocabulary and words typically have more meanings. Moreover, the stylistic writing (e-mail dialogues) is very distant from the other more technical collections.

Download Here

- All 20,000 documents (There is no fixed literature split. It has usually been used with cross validation techniques)

Some additional information as well as the accuracy evaluations of the above corpora can be found below.

Corpus References

[Moschitti and Basili, 2004]. Alessandro Moschitti and Roberto Basili, Complex Linguistic Features for Text Classification: a comprehensive study. In proceedings of the 26th European Conference on Information Retrieval Research (ECIR 2004), Sunderland, U.K., 2004.

[Moschitti, 2003a]. Alessandro Moschitti, Natural Language Processing and Text Categorization: a study on the reciprocal beneficial interactions, PhD thesis, University of Rome Tor Vergata, Rome, Italy, May 2003.

[Moschitti, 2003b]. Alessandro Moschitti, A study on optimal parameter tuning for Rocchio text classifier. In proceedings of the 25th European Conference on Information Retrieval Research (ECIR 2003), Pisa, Italy, April, 2003.

General References

[Joachims, 1997] Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features. LS8-Report 23, Universitat Dortmund, LS VIII-Report, 1997.

[Sebastiani, 2002] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.

[Yang, 1999] Yiming Yang, An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, Vol 1, No. 1/2, pp 67--88, 1999.

Back to Home

Top of the Page

Maintained by Alessandro Moschitti moschitti[at]dit.unitn.it