| Name | Last modified | Size | Description | |
|---|---|---|---|---|
| Parent Directory | - | |||
| README.html | 20-Nov-2007 22:14 | 881 | ||
| alt.atheism/ | 20-Nov-2007 17:53 | - | ||
| comp.graphics/ | 10-Nov-2007 16:55 | - | ||
| comp.os.ms-windows.misc/ | 10-Nov-2007 17:16 | - | ||
| comp.sys.ibm.pc.hardware/ | 10-Nov-2007 18:00 | - | ||
| comp.sys.mac.hardware/ | 10-Nov-2007 18:04 | - | ||
| comp.windows.x/ | 10-Nov-2007 18:12 | - | ||
| misc.forsale/ | 10-Nov-2007 18:15 | - | ||
| rec.autos/ | 10-Nov-2007 18:21 | - | ||
| rec.motorcycles/ | 10-Nov-2007 18:26 | - | ||
| rec.sport.baseball/ | 10-Nov-2007 18:31 | - | ||
| rec.sport.hockey/ | 10-Nov-2007 18:38 | - | ||
| sci.crypt/ | 10-Nov-2007 18:47 | - | ||
| sci.electronics/ | 10-Nov-2007 18:52 | - | ||
| sci.med/ | 27-Nov-2007 09:24 | - | ||
| sci.space/ | 20-Nov-2007 19:45 | - | ||
| soc.religion.christian/ | 10-Nov-2007 19:19 | - | ||
| talk.politics.guns/ | 10-Nov-2007 19:31 | - | ||
| talk.politics.mideast/ | 10-Nov-2007 19:44 | - | ||
| talk.politics.misc/ | 10-Nov-2007 19:55 | - | ||
| talk.religion.misc/ | 10-Nov-2007 20:04 | - | ||
This is a famous set of data on Usenet newsgroups. The goal is automatic classification of a posting to the proper newsgroup. This is probably not a very practical goal in itself, but the dataset is used a lot as a testbed for research into document classification methods for problems that are practical.
The format consists of the postings in 20 different newsgroups, from the UC Irvine KDD Archive.
I have produced versions in which the headers have been removed; these files have a .nohead suffix.
In each newsgroup directory, I've also produced a file named GrandTotal, which shows how many times each word appeared, over all the files in that directory. I only retained words whose frequencies were at least 200.