A brief introduction to Hadoop, especially streaming mode, is in the chapter on Cloud Computing in my open source book on parallel programming.
To set up use of Hadoop on CSIF, do the following:
set path = ( /usr/local/hadoop-0.20.2/bin $path ) setenv JAVA_HOME /usr/lib/jvm/jre-1.6.0-openjdk.x86_64 setenv HADOOP_LOG_DIR ~/hadoop/logs setenv HADOOP_CONF_DIR ~/hadoop/conf alias hadoopstart "hadoop namenode -format; start-all.sh"
The first four are required. I put the fifth one in as a convenience. It formats the Hadoop file system, and starts the Hadoop daemons. I run it (once) each day I use Hadoop on CSIF. (Note: This needs to be done each day, due to the nightly rebooting of CSIF systems. The daemons disappear, and even the file system does, as CSIF stores your HDFS files in /tmp.)