Index of /~matloff/132/Data/20_newsgroups

[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[TXT]README.html20-Nov-2007 22:14 881  
[DIR]alt.atheism/20-Nov-2007 17:53 -  
[DIR]comp.graphics/10-Nov-2007 16:55 -  
[DIR]comp.os.ms-windows.misc/10-Nov-2007 17:16 -  
[DIR]comp.sys.ibm.pc.hardware/10-Nov-2007 18:00 -  
[DIR]comp.sys.mac.hardware/10-Nov-2007 18:04 -  
[DIR]comp.windows.x/10-Nov-2007 18:12 -  
[DIR]misc.forsale/10-Nov-2007 18:15 -  
[DIR]rec.autos/10-Nov-2007 18:21 -  
[DIR]rec.motorcycles/10-Nov-2007 18:26 -  
[DIR]rec.sport.baseball/10-Nov-2007 18:31 -  
[DIR]rec.sport.hockey/10-Nov-2007 18:38 -  
[DIR]sci.crypt/10-Nov-2007 18:47 -  
[DIR]sci.electronics/10-Nov-2007 18:52 -  
[DIR]sci.med/27-Nov-2007 09:24 -  
[DIR]sci.space/20-Nov-2007 19:45 -  
[DIR]soc.religion.christian/10-Nov-2007 19:19 -  
[DIR]talk.politics.guns/10-Nov-2007 19:31 -  
[DIR]talk.politics.mideast/10-Nov-2007 19:44 -  
[DIR]talk.politics.misc/10-Nov-2007 19:55 -  
[DIR]talk.religion.misc/10-Nov-2007 20:04 -  

This is a famous set of data on Usenet newsgroups. The goal is automatic classification of a posting to the proper newsgroup. This is probably not a very practical goal in itself, but the dataset is used a lot as a testbed for research into document classification methods for problems that are practical.

The format consists of the postings in 20 different newsgroups, from the UC Irvine KDD Archive.

I have produced versions in which the headers have been removed; these files have a .nohead suffix.

In each newsgroup directory, I've also produced a file named GrandTotal, which shows how many times each word appeared, over all the files in that directory. I only retained words whose frequencies were at least 200.