Introductory C Program

The program below will serve as our introduction to C. It implements a subset of the Unix wc command, which reports the number of characters, words and lines in a text file.

footnote: Type man wc to get a fuller account of this Unix command if you are curious.

Since we have not covered the material on file manipulation in C yet, we will read from what in Unix is called the standard input. By default, this means input from the keyboard, but by using the Unix input-redirection symbol, <, we can have a file playing the role of the standard input, and thus our wc program will be able to read from real files (more on this below).

Program Listing

The source file, i.e. the C-language file (as opposed to the machine-language file which the compiler generates from this file) for the program is as follows (note of course that the line numbers have been added here for clarity,

footnote: The Unix cat command, with the -n option, was used to get the line numbers.
and were not part of the source file):

  1	
  2	/* introductory C program 
  3	
  4	   implements (a subset of) the Unix wc command  --  reports character, 
  5	   word and line counts; in this version, the "file" is read from the 
  6	   standard input, since we have not covered C file manipulation yet, 
  7	   so that we read a real file can be read by using the Unix `<'
  8	   redirection feature */
  9	
 10	
 11	#define MaxLine 200  
 12	
 13	
 14	char Line[MaxLine];  /* one line from the file */
 15	
 16	
 17	int NChars = 0,  /* number of characters seen so far */
 18	    NWords = 0,  /* number of words seen so far */
 19	    NLines = 0,  /* number of lines seen so far */
 20	    LineLength;  /* length of the current line */ 
 21	
 22	
 23	PrintLine()  /* for debugging purposes only */
 24	
 25	{  int I;
 26	
 27	   for (I = 0; I < LineLength; I++) printf("%c",Line[I]);
 28	   printf("\n");
 29	}
 30	
 31	
 32	int WordCount()
 33	
 34	/* counts the number of words in the current line, which will be taken
 35	   to be the number of blanks in the line, plus 1 (except in the case 
 36	   in which the line is empty, i.e. consists only of the end-of-line 
 37	   character); this definition is not completely general, and will be
 38	   refined in another version of this function later on */
 39	
 40	{  int I,NBlanks = 0;  
 41	
 42	   for (I = 0; I < LineLength; I++)  
 43	   if (Line[I] == ' ') NBlanks++;
 44	
 45	   if (LineLength > 1) return NBlanks+1;
 46	   else return 0;
 47	}
 48	
 49	
 50	int ReadLine()
 51	
 52	/* reads one line of the file, returning also the number of characters
 53	   read (including the end-of-line character); that number will be 0
 54	   if the end of the file was reached */
 55	
 56	{  char C;  int I;
 57	
 58	   if (scanf("%c",&C) == -1) return 0;
 59	   Line[0] = C;
 60	   if (C == '\n') return 1; 
 61	   for (I = 1; ; I++) {
 62	      scanf("%c",&C);     
 63	      Line[I] = C;  
 64	      if (C == '\n') return I+1;
 65	   }  
 66	}
 67	
 68	
 69	UpdateCounts()
 70	
 71	{  NChars += LineLength;
 72	   NWords += WordCount();
 73	   NLines++;
 74	}
 75	 
 76	 
 77	main()  
 78	
 79	{  while (1)  {
 80	      LineLength = ReadLine();
 81	      if (LineLength == 0) break;
 82	      UpdateCounts();
 83	   }
 84	   printf("%d %d %d\n",NChars,NWords,NLines);
 85	}
 86

Compiling, Running and Testing the Program

To compile the program, whose source file I named WC.c, I gave the command

gcc -g WC.c

This produces a file a.out, which is the machine-language, executable file. The -g option tells the compiler not to discard the symbol table, i.e. the variable names; these now will be retained in the a.out file, which will make debugging much easier, as we will see later.

To execute the program, just type

a.out

(if inputting from the keyboard) or

 
a.out < filename

(if inputting from some file). In the latter case, we are tricking the program; it thinks it is reading from the keyboard, but we are arranging things so that it reads from the given file instead.

footnote: In a similar manner, we can redirect the output to a file. The program here thinks it is writing to our terminal screen, but if we give the command

a.out < filename1 > filename2

then the output will go to filename2 instead.

To test the program, I typed

 
a.out < z

taking input from the file z. The latter, which I created using the emacs editor, consisted first of a line ``ABC'', then a line ``DE F'', then an empty line, then a line ``G''. I can use cat to get a quick look at the file:

heather% cat z
ABC
DE F

G

This is a total of 12 characters

footnote: This includes the end-of-line characters, one for each line. We can check this by typing

ls -l z

(type man ls for more information)

, four ``words,'' and four lines. Sure enough, my test worked (after some debugging first!):

heather% a.out < z
12 4 4

Analysis of the Program

As in writing a program, one should read someone else's program in a top-down manner, by first taking a look at the program's global variables, and then reading the main program. Let's do so:

Declarations of Global Variables, Lines 14-20:

On Line 14, we see an array named Line. It is an array of characters. We will be reading one line of the input file at a time, storing the current line in this array Line. The length of that line will be stored in an integer variable LineLength (Line 20).

Since we are supposed to be counting the total numbers of characters, words and lines in the file, we have variables set up to do that (Lines 17-19).

Main Program, Lines 77-85:

Every C program is required to have a function main(), which serves as the analog of the main program in Pascal, i.e. execution starts here.

The bulk of the function main() in this example consists of a while loop, in Lines 79-83. C-language while loops are very similar to those in Pascal, but differ in some details.

First of all, in C there is no formal data type corresponding to Pascal's boolean. Instead, any nonzero value is treated as `true', while zero is considered `false'. Thus in Line 79 we have something which would correspond to ``while (true)'' in Pascal.

Second, the C analogs of Pascal's begin and end are the characters `' and `', i.e. left- and right-braces. So, the loop extends from Line 79 to Line 83, as indicated earlier.

Third, C has a leave-the-loop statement, break. Thus in this example, the point at which the loop is exited is Line 81.

On Line 80 we have a call to a function ReadLine(). As its name implies, it will read a line from the input file; it will also return the number of characters in that line (as stated in the comments at the definition of the function, Lines 52-54); we are recording this returned value in the variable LineLength.

The on Line 81, we are asking whether the line was empty, implying that we have reached the end of the file, in which case we will leave the loop. But otherwise we will (on Line 82) make a call to a function UpdateCounts(), which as its name implies, will update the number of characters, words and lines we have seen so far.

footnote: Note the phrasing here, ``make a call to a function.'' We did not say ``make a call to a procedure.'' In C, all subprograms--and for that matter, main() too--are called functions, whether they return values or not.

Note very, VERY carefully that in C, tests of equality are made with the symbol ==, not =. If for example we had written Line 81 as

if (LineLength = 0) break;

this would have been legal, and accepted by the compiler, but would have produced very wrong results: The value 0 would be assigned to LineLength, resulting in the whole expression being 0, thus being considered `false', and the break would never be executed; it would be an infinite loop.

So, the while loop is pretty simple: Read a line, update counts, read a line, update counts, ..., until you reach the end of the file.

On Line 84, we print out the results, by making a call to printf(), a function in the C library. It is very similar to the Pascal write() procedure, but in C there is more flexibility, and consequently more details to attend to.

For example, one must state the format which is to be used in printing out each item. Here we have specified the %d (``decimal'') format, used to print out items in integer form, in printing out all three variables. We have also asked for exactly one space between each pair of consecutive items, and for a new-line character n to be written after the third item, forcing the cursor to go to the next line on the screen.

We will note in passing that even though printf() is a library function, i.e. a function that we didn't write ourselves, it is of course a function, and it has parameters, in this particular instance four parameters--a string (the quoted part) and three integers (NChars, NWords, NLines). This was true in Pascal too--Pascal's write() is a procedure, with parameters--but you will find that in C one needs to pay more attention to such things, so we have mentioned it here.

Now continuing in a top-down fashion, let's look at the other functions, and the other miscellaneous components of the program:

Lines 2-8: This is a comment. As you can see, comments in C are delineated by /* and */.

Line 11: C's define is like Pascal's const, but much more powerful, as we will see later.

Line 14: This declares an array of MaxLine characters. In C, array subscripts start at 0, so this declaration here would be Line: array[0..MaxLine-1] of char in Pascal.

The type char in C is treated as a special case of the integer type, int. Thus, since integers can be compared, e.g. in

int X,Y;
...
...
if (X < Y) ...

then so can characters, e.g. in

char C;
...
...
if (C > `g') ...

Lines 17-19: One can initialize variables in C declarations.

footnote: Concerning variables whose initial values you wish to be 0: Most compilers will automatically set all bits of a variable to 0 if you don't specify otherwise, and thus if you wish the initial value of, say, an int variable to be 0, this will be done automatically. However, it is good practice to explicitly initialize such variables to 0, for two reasons. First, the compiler might not do so, and this may produce a hard-to-find bug. Secondly, it makes your program clearer.

Lines 23-29: Here the function PrintLine() is being defined. It has no parameters, but the parentheses are required anyway.

Line 25: The integer variable I is declared here as a local variable.

Line 27: Here is a for loop. In C, for loops are like those in Pascal, but are much more powerful.

In this loop, the three fields say that: I will be initialized to 0 (note that in C the assignment operator is =, not :=); the loop will iterate until the condition I < LineLength is violated; and the variable I will be incremented by 1 (``++'') at the end of each iteration.

footnote: Instead of I++ we could use I = I + 1. This would work, but may be less efficient, depending on the machine and the compiler used. Constructs like this were included in the original design of the C language so as to give hints to compiler, so that the latter could produce more efficient machine code. Of course, it also makes typing easier for the programmer.

The body of the loop here consists of a single statement, a call to the printf() function. Here we are printing to the screen the Ith character of our input line, doing the printing using the character format %c.

The second field in a for loop can include any boolean expression; e.g. it could be I < LineLength && J > 12, where && is the analog of Pascal's and operator.

Line 32: This is the start of the function WordCount(). It counts the number of words in the current line, and returns that value. The type of the returned value is integer.

footnote: As mentioned earlier, all C subprograms are called ``functions,'' whether or not they return values. If we don't state the type of the return value, it is by default of type integer. We can, however, declare the return value to be of type void, for the sake of clarity; it is a way of saying explicitly that the function is not intended to return a value.

Line 40: Local variables.

Lines 42-43: Here we are counting the number of blank characters.

Line 43: Note again that in C we use a single = for assignment statements, but a double one == for testing equality. Use of the former when the latter is needed is a common mistake for learners of C.

Note the use of the increment operator ++ again.

Lines 45-46: C's if-then-else construct takes the form

if (boolean expression) statement1
else statement2

In C, all statements must be terminated by a semicolon (except compound statements, discussed below). So, the semicolon in Line 45 is part of the ``statement1'' here, not part of the if-then-else itself.

The return construct is what is used to give the function its return value. By contrast, in Pascal, the analog of return 0 in Line 50 would be WordCount:=0

footnote: return can always be used without a value; this results in leaving the function and going back to the point of the call, but without returning a value. This is useful in functions which are like Pascal procedures instead of Pascal functions. You may also find it useful to quit a program at some line in the middle of the source code, instead of at the last line; you can call the exit() function for this.
.

Line 50: Here is the ReadLine() function. Note again that it returns an integer value, so we have written ``int'' on this line, just before the function name.

Line 58: The function scanf() is like Pascal's read(). Again, it requires that formats be specified, as with printf(). We are reading a single character (note the %c format) into the variable C. But why do we need the ampersand (&) in front of the C? Here is the reason:

As mentioned earlier for printf(), the function scanf() does have parameters, in this instance two. Recall that in Pascal there are two types of parameters, pass-by-value and pass-by-reference. The latter type is denoted by the keyword var, and is used for any parameter whose value will be changed by the subprogram. The situation in C is somewhat similar, in that the two cases must be distinguished, and in the pass-by-reference case we must take special action. That special action is to write the ampersand, as we have done here for the parameter C (note that C does get changed by the function; whatever value it had before the call will now be replaced by the new character just read in).

Pass-by-reference is a bit more delicate in C than in Pascal, and thus here in this introductory C program we have avoided using parameters except where absolutely necessary, i.e. except in the calls to printf() and scanf(). We will see how to deal with parameters in a later unit.

On Line 58, we are also taking advantage of the fact that, like many of the C library functions, scanf() returns a value which serves as an error code. If the value returned is -1, for example, this means that the attempt to read failed because we reached the end of the file; we check for that here. On Line 62, though, we don't need to make this check, since we are in the middle of reading a line there (we are counting on there being a newline character), so we just ignore the value returned by scanf() here.

On Line 61, note that the second of the three fields in the for statement is blank. This illustrates the flexibility of C's for relative to Pascal's. On the other hand, it is somewhat dangerous in this case, if for example a line's length exceeds MaxLine, the length of our array Line.

footnote: It is important to note that most C compilers will not produce code to check for ``array index out of bounds'' errors like this. The program will not be killed if this occurs, and erratic results may occur.

Lines 71-72: Here we are using the += operator. Line 71, for example, is functionally equivalent to

NChars = NChars + LineLength;

However, as with ++, the += operator is used to give a hint to the compiler; depending on the machine, the compiler may be make use of the knowledge that NChars is both a source and a destination operand here, and thus produce more efficient machine code.

Remarks on Style: I have definitely used a top-down design here, keeping all functions very short, and having the functions themselves contain a number of calls to further functions, with meaningful names, so that one can glance through a function and get a quick overview of what any module does.

Similarly, I have used indenting to clarify the program. Lines 61-65 exemplify this, with the body of the for loop being indented. Also, I have followed the usual C convention of using braces { } in a ``triangular'' form, with the ``for'', the { and the } forming a triangle:

for (...)  {
   ...
   ...
}

And of course I have included lots of comments, especially to describe what roles the variables play (see the comments in Lines 17-20).

Employ these devices--top-down style, indenting and comments--in all your programs, FROM THE MOMENT YOU START WRITING THEM, NOT JUST WHEN YOU ARE DONE WRITING AND DEBUGGING! You will save yourself a lot of time, both in writing and in debugging. Use of top-down style is especially important. It will help your thinking process tremendously during the time you are writing the program. You may have been told in the past that this is to help other people, i.e. to help other people read your program; that is true, but it is also to help yourself! It will save you time!

A More Sophisticated Function WordCount()

The WC program used above as an introduction to C is not fully general. It does not cover the case in which two words are separated by two or more blanks, or the case in which a blank begins or ends a line.

Below is another version of the function WordCount() from that program. It is more general, covering these exceptional cases. The strategy is outlined in the comments: We keep looping, alternately skipping over blanks until a word is found, recording that word in our word count, and then skipping over that word.

The version of WordCount() here is important because it uses for and while loops in a more sophisticated fashion. Look at Line 20, for example:

 
   for (I = 0; I <= LLMinus2; ) {

Notice the third field, which normally would contain something like I++, is blank; in other words, this field says, ``At the end of each loop iteration, do nothing,'' as opposed to, say, ``At the end of each loop, increment I.'' Instead, the incrementing of I is done within the loop itself, at Lines 22 and 30, and actually may be incremented several times within one iteration of the loop. So here is an example of how C's for loops are more flexible/powerful than Pascal's.

The two while loops here are like Pascal's, but pay attention to the notation in Line 30: ``!='' means ``not equal to,'' like Pascal's `<>. ``&&'' means ``and.''

 
     1	
     2	int WordCount()
     3	
     4	{  int I,
     5	       LLMinus2,  /* position of the character just before the
     6	                     end-of-line character */
     7	       Count;  /* number of words we have encountered so far in
     8	                  this line */
     9	
    10	   /* if the line is empty, i.e. consists only of the end-of-line
    11	      character, then there are no words in this line */
    12	   if (LineLength == 1) return 0;
    13	
    14	   Count = 0;
    15	   LLMinus2 = LineLength - 2;
    16	
    17	   for (I = 0; I <= LLMinus2; ) {
    18	      /* scan until reach nonblank */
    19	      while (Line[I] == ' ') I++;
    20	      /* if not yet at end of line, we have found a word;
    21	         otherwise leave */
    22	      if (I <= LLMinus2) Count++; else break;
    23	      /* scan through this word, until we get past it; at that
    24	         time we will either be at a blank or the end of the
    25	         line; in the latter case we will leave, but otherwise
    26	         will continue with the loop */
    27	      while (Line[I] != ' ' && I <= LLMinus2) I++;
    28	   }
    29	
    30	   return Count;
    31	}
    32



Norm Matloff
Wed Nov 8 17:29:54 PST 1995