Homework I

Due Tuesday, October 17

Notes on Assignment Procedure

You submit your work as a .tar file, one per group. See our course syllabus for the requirements in terms of file name.

In any Homework assignment, include all the files for problems due on the same date in the same .tar file, NO SUBDIRECTORIES. When the TA's grading script unpacks the file (tar xf x.tar), all the files should be in the directory from which he issued the unpack command. Be sure to use exactly the same function names, file names etc. as in the specs! Don't break the TA's script! In order for the TA's work to go smoothly, follow the rules exactly; he may dock your grade if there are problems.

Here and in future assignments, you will submit your work as functions, which the TA's secret test scripts will call. Do not submit executable programs.

You will submit your work to the TA via the CSIF handin utility. Details will be posted soon. Your work will be graded interactively, as discussed at the beginning of the course and in our course syllabus . Recall that different members of a group may receive different grades.

In each of the two cases here, and in most programs assigned this quarter, Extra Credit will be awarded to the three groups with the fastest times in the TA's test. This will be used both to bump up your course grade and in job queries (employers sometimes call me, asking to suggest someone for a job), letters of recommendation etc.

Follow the assignment specs exactly. Note that statements like, "Use OpenMP" or "Use such-and-such an algorithm" in the specs mean that you MUST use them; they are not merely suggestions. Other than that, it is up to you how to code.

Except when explicitly stated in problem specs, you are not allowed to use any external libraries. You are limited to the C library for C and base R for R.

If you seek help on a bug from the TA or me via e-mail, use the Unix script command and send us the output to demonstrate the problem. Please do NOT send screenshots, as we cannot run them, right?

Make sure you use a debugging tool before turning to us. In the case of R's parallel package, this requires a special procedure, which I will post. In whatever language, follow my Fundamental Principle of Debugging:

As you step through your code with the debugging tool, repeatedly CONFIRM that what you THINK is true IS true. You think x should be 3 at a certain point of execution, so confirm it. You think that the "else" part of an if-then-else will be taken; confirm it. EVENTUALLY SOMETHING WILL FAIL TO CONFIRM, telling you at least where the bug is if not its nature.

Problem 1 (of 1)

Here we will be working with directed graphs. You will write functions, one in C/OpenMP and the other in R/"snow", to find the number of unordered vertex pairs (i,j), i not equal to j, that we will call reciprocal, meaning that links exist from i to j and vice versa.

Input to the functions is a two-column array (or similar object; see below), where a row consisting of i then j means a corresponding edge. For instance, the edge list

1 6
2 1
3 3
1 2
1 3
2 4
4 2
3 1
6 3

would return the value 3, picking up the reciprocal pairs (1,2), (1,3) and (2,4). Vertex numbers are assumed to be positive integers but will not necessarily be contiguous. It is assumed that there are no duplicate rows in the edge list.

Your code will take the form of a function recippar(), to be described in detail below. A natural question to ask is, How large a problem is your program expected to handle, in the sense of memory allocation? The answer is that it must work on at least the Twitter data described below.

This may be much harder to program efficiently in parallel, or for that matter even in serial, than it looks. I STRONGLY RECOMMEND WRITING THE SERIAL VERSION FIRST IN EACH CASE.

The TA's speed test will consist of the "twitter combined" file at the SNAP datasets site. He will decide on some CSIF machine on which to run the tests. Note that the timings are only for the times taken by the functions, NOT the time needed to read in the data from disk.

R version:

Your function recippar() will have the call form

recippar(edges)

where edges is a two-column matrix, data frame or data table. It must be stored in a file Problem1.R, which must contain only functions, class definitions (NOT required) and the like, i.e. no freestanding code. The function returns the number of reciprocal pairs. The function must use the "snow" portion of the parallel package.

You may find it helpful to use R's split() function, possibly in conjunction with names(), as.character() and as.numeric().

Note that the data.table package by the amazing Matt Dowle is much faster for file reading, merging of two data objects etc. than the corresponding operations in base R.

C version:

Your function must be declared as

int recippar(int *edges,int nrow)

where edges works in essence as a two-column array with nrows rows. Use row-major storage. Any arrays created within the function must be dynamic. Place your function in a file Problem1.c, again with no freestanding code.