\chapter{Compilation and Linking Process}
\label{chap:compillink}

In learning computer systems, it's vital to understand how the systems
software works.  In this chapter, we'll look at compiling and linking.

\section{GCC Operations}
\label{obj}

Say our C program consists of source files {\bf x.c} and {\bf y.c}, and
we compile as follows:

 \begin{Verbatim}[fontsize=\relsize{-2}]
gcc -g x.c y.c
\end{Verbatim}

GCC\footnote{The ``English'' name of this compiler is the GNU C
Compiler, with acronym GCC, so it is customary to refer to it that way.
The actual name of the file is {\bf gcc}.} is basically a ``wrapper''
program, serving only a manager of the compilation process; it does not
do the compilation itself.  Instead, GCC will run other programs which
actually do the work.  We'll illustrate that here, using the {\bf
x.c}/{\bf y.c} example throughout.

\subsection{The C Preprocessor}

First GCC will run the C preprocessor, {\bf cpp}, which processes your
{\bf \#define}, {\bf \#include} and other similar statements.  For
example:

\begin{Verbatim}[fontsize=\relsize{-2}]
% cat y.c
// y.c

#define ZZZZ 7

#include "w.h"

main()
{  int x,y,z;

   scanf("%d%d",&x,&y);
   x *= ZZZZ;
   y += ZZZZ;
   z = bigger(x,y);
   printf("%d\n",z);
}

% cat w.h
// w.h

#define bigger(a,b) (a > b) ? (a) : (b)

% cpp y.c
# 1 "y.c"
# 1 "<built-in>"
# 1 "<command line>"
# 1 "y.c"

# 1 "w.h" 1
# 6 "y.c" 2

main()
{ int x,y,z;

   scanf("%d%d",&x,&y);
   x *= 7;
   y += 7;
   z = (x > y) ? (x) : (y);
   printf("%d\n",z);
}
\end{Verbatim}

The preprocessor's output, seen above, now can be used in the next
stage:

\subsection{The Actual Compiler, CC1, and the Assembler, AS}

Next, GCC will start up another program, {\bf cc1}, which does the
actual code translation.  Even {\bf cc1} does not quite do a full
compile to machine code.  Instead, it compile only to assembly language,
producing a file {\bf x.s} GCC will then start up the assembler, AS
(file name {\bf as}), which translates {\bf x.s} to a true machine
language (1s and 0s) file {\bf x.o}.  The latter is called an {\bf
object file}.  

Then GCC will go through the same process for {\bf y.c}, producing {\bf
y.o}.  

Finally, GCC will start up the {\bf linker} program, LD (file name {\bf
ld}), which will splice together {\bf x.o} and {\bf y.o} into an
executable file, {\bf a.out}.  GCC will also delete the intermediate
{\bf .s} and {\bf .o} files it produced along the way, as they are no
longer needed.

\section{The Linker: What Is Linked?}
\label{linking} 

Recall our example above of a C program consisting of two source files,
{\bf x.c} and {\bf y.c}.  We noted that the compilation command,

\begin{Verbatim}[fontsize=\relsize{-2}]
gcc -g x.c y.c
\end{Verbatim}

would temporarily produce two object files, {\bf x.o} and {\bf y.o}, and
that GCC would call the linker program, LD to splice these
two files together to form the executable file {\bf a.out}.

Exactly what does the linker do?  Well, suppose {\bf main()} in {\bf
x.c} calls a function {\bf f()} in {\bf y.c}.  When the compiler sees
the call to {\bf f()} in {\bf x.c}, it will say, ``Hmm...There is no
{\bf f()} in {\bf x.c}, so I can't really translate the call, since I
don't know the address of {\bf f()}.''  So, the compiler will make a
little note in {\bf x.o} to the linker saying, ``Dear linker:  When you
link {\bf x.o} with other {\bf .o} files, you will need to determine
where {\bf f()} is in one of those files, and then finish translating
the call to {\bf f()} in {\bf x.c}.''\footnote{Or, in our GCC command
line we may ask the linker to get other functions from libraries.
See Section \ref{libraries} below.}

Similarly, {\bf x.c} may declare some global variable, say {\bf z},
which the code in {\bf y.c} references.  The compiler will then leave a
little note to the linker in {\bf y.o}, etc.

So, the linker's job is to resolve issues like this before combining the
{\bf .o} files into one big executable file, {\bf a.out}.  

It doesn't matter whether our original source code was C/C++ or
assembly language.  Machine code is machine code, no matter what its
source is, so the linker won't care which original language was the
source of which {\bf .o} file.

Though GCC invokes LD, we can run it directly too, which as we have seen
is common in the case of writing assembly language code.\footnote{We can
also apply GCC to the assembly-language file.  GCC will notice that the
file name has a {\bf .s} extension, and thus will invoke the assembler,
AS.}

\section{Headers in Executable Files}

Even if our source code consists of just one file (as opposed to, for
instance, our example of {\bf x.c} and {\bf y.c} above), so that there
is nothing to link, we must still invoke the linker to to produce an
executable file.  There are a couple of reasons for this.

First, in the case of C, you are linking in libraries without realizing
it.  If you run the {\bf ldd} command on an executable which was
compiled from C, you'll find that the executable uses the C library,
{\bf libc.so}.  (More on {\bf ldd} below.)  At the very least, that
library provides a place to begin execution, the label {\bf \_start}
which we found in Section \ref{analysis} is needed, and sets up your
program's stack.  That library also includes many basic functions, such
as {\bf printf()} and {\bf scanf()}.

But beyond that, executable files consist not only of the actual machine
code but also a {\bf header}, a kind of introduction, at the start of
the file.  The header will state at what memory locations the various
sections are to be loaded, how large the sections are, at what address
execution is to begin, and so on.  The linker will make these decisions,
and place them in the header. 

The are various standard formats for these headers.  In Linux, the ELF
format is used.  If you are curious, the Linux {\bf readelf} command
will tell you what is in the header of any executable file.  By running
this command with, e.g., the {\bf -s} option, you can find out the final
addresses assigned to your labels by the linker.\footnote{A more common
command for this is {\bf nm}.  It's more general, as it does not apply
only to ELF files, but it only gives symbol information.}

There are headers in object files as well; the Linux command {\bf
objdump} will display these for you.

\section{Libraries}
\label{libraries}

A library is a conglomeration of several object files, collected
together in one convenient place.  Libraries can be either {\bf static}
or {\bf dynamic}, as explained below.  In Unix, library names end with
{\bf .a} (static) or {\bf .so} (dynamic), possibly followed by a version
number.\footnote{You may have encountered dynamic libraries on Windows
systems, which have the suffix {\bf .dll}.} The names also usually start
with ``lib,'' so that for example {\bf libc.so.6} is version 6 of the C
library.

When you need code from a static library, the linker physically places
the code in with your machine code.  In the dynamic case, though, the
linker merely places a note in your machine code, which states which
libraries are needed; at run time, the OS will do the actual linking
then.  Dynamic libraries save a lot of disk space and memory (at the
slight cost of a bit of a delay in loading the program into memory at
run time), since we only need a single copy of any given library.  Here
is an overview of how the libraries are created and used:

One creates a static library by applying the {\bf ar} command to the
given group of {\bf .o} files, and then possibly running {\bf ranlib}.
For example:

\begin{Verbatim}[fontsize=\relsize{-2}]
% gcc -c x.c
% gcc -c y.c
% ar lib8888.a x.o y.o
% ranlib lib8888.a
\end{Verbatim}

Later, if someone wants to check what's inside a static library, one
runs {\bf ar} on the {\bf .a} file, with the {\bf t} option, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
% ar t lib8888.a
\end{Verbatim}

To create a dynamic library, one needs to use the {\bf -fPIC} option
when compiling to produce the {\bf .o} files, and then one compiles the
library by using GCC's {\bf -shared} option, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
% gcc -g -c -fPIC x.c
% gcc -g -c -fPIC y.c
% gcc -shared -o lib8888.so x.o y.o
\end{Verbatim}

In systems that use ELF, you can check what's in a dynamic library by
using {\bf readelf}, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
% readelf -s lib8888.so
\end{Verbatim}

When you link to a library, GCC's {\bf -l} option can be used to state
which library to link in.  For example, if you have a C program {\bf
w.c} which calls {\bf sqrt()}, which is in the math library {\bf
libm.so}, you would type

\begin{Verbatim}[fontsize=\relsize{-2}]
% gcc -g w.c -lm
\end{Verbatim}

The notation ``-lxxx'' means ``link the library named {\bf libxxx.a}
or {\bf libxxx.so}.'' 

One point to note, though, is that the library you need may not be in
the default directories {\bf /usr/lib}, {\bf /lib} and those listed in
{\bf /etc/ld.so.cache}.  If for example your library {\bf libqrs.so} is
in the directory {\bf /a/b/c} , you would type

\begin{Verbatim}[fontsize=\relsize{-2}] 
% gcc -g w.c -lqrs -L/a/b/c
\end{Verbatim}

This is fine in the static case, but remember that in the dynamic case,
the library is not actually acquired at link time.  All that is put in
our executable file is the name of the library, but NOT its location.
In other words, in the dynamic case, the {\bf -L/a/b/c} in the example
above is used by the linker to verify that the library does exist, but
this location is NOT recorded.  When the program is actually run, the OS
will still look only in the default directories.  There are various ways
to tell it to look at others.  One of the ways is to specify {\bf /a/b/c} 
within the Unix environment variable LD\_LIBRARY\_PATH, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
% setenv LD_LIBRARY_PATH /a/b/c
\end{Verbatim}

If you need to know which dynamic libraries an executable file needs,
run {\bf ldd}, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
% ldd a.out
\end{Verbatim}

Clearly this is a complex topic.  If you ever need to know more than
what I have in this introduction, plug something like ``shared library
tutorial'' into a Web search engine.

\section{A Look at the Final Product}

So, let's say we take our first sample program, in Section
\ref{sample1}, and assemble it and link it, with the executable file
being {\bf tot}.  

Recall that we had {\bf .data} and {\bf .text} sections.  The linker
chooses addresses for the sections.  We can determine those, for
example, by running

\begin{Verbatim}[fontsize=\relsize{-2}]
% readelf -s tot
\end{Verbatim}

on our executable file {\bf tot}.  The relevant excerpt of the output is

\begin{Verbatim}[fontsize=\relsize{-2}]
     9: 08049094     0 NOTYPE  LOCAL  DEFAULT    2 x
    10: 080490a4     0 NOTYPE  LOCAL  DEFAULT    2 sum
    11: 08048083     0 NOTYPE  LOCAL  DEFAULT    1 top
    12: 0804808b     0 NOTYPE  LOCAL  DEFAULT    1 done
    13: 08048074     0 NOTYPE  GLOBAL DEFAULT    1 _start
\end{Verbatim}

So we see that {\bf ld} has arranged things so that, when our program is
loaded into memory at run time, the {\bf .data} section will start at
0x08049094 and the {\bf .text} section at 0x08048074.

We can use GDB to confirm that these addresses really hold at run time:

\begin{Verbatim}[fontsize=\relsize{-2}]
% gdb tot
(gdb) p/x &x
$1 = 0x8049094
(gdb) p/x &_start
$2 = 0x8048074
\end{Verbatim}

We can also confirm that the linker really did fix the temporary machine
code the assembler had produced from 

\begin{Verbatim}[fontsize=\relsize{-2}]
movl $x, %ecx
\end{Verbatim}

Recall that the assembler didn't know the address of {\bf x}, since the
location of the {\bf .data} section would not be set until later, when
the linker ran.  So, the assembler left a note in the {\bf .o} file,
asking the linker to put in the real address.  Let's check that it did:

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) b 24
Breakpoint 1 at 0x804807e: file tot.s, line 24.
(gdb) r
Starting program: /root/50/tot
Breakpoint 1, _start () at tot.s:24
24            movl $x, %ecx  # ECX will point to the current
Current language:  auto; currently asm
(gdb) disassemble
Dump of assembler code for function _start:
0x08048074 <_start+0>:  mov    $0x4,%eax
0x08048079 <_start+5>:  mov    $0x0,%ebx
0x0804807e <_start+10>: mov    $0x8049094,%ecx
End of assembler dump.
(gdb) x/5b 0x0804807e
0x804807e <_start+10>:  0xb9    0x94    0x90    0x04    0x08
\end{Verbatim}

We see that the machine code for the instruction really does contain the
actual address of {\bf x}.  

\checkpoint


