\chapter{Subroutines on Intel CPUs}
\label{chap:sub}

\section{Overview}

Programming classes usually urge the students to use {\bf top-down} and
{\bf modular} design in their coding.  This in turn means using a lot of
calls to {\bf functions} or {\bf methods} in C-like languages, which are
called {\bf subroutines} at the machine/assembly level.\footnote{This
latter term is also used in various high-level languages, such as
FORTRAN and Perl.} 

This document serves as a brief introduction to subroutine calls in
Intel machine language.  It assumes some familiarity with Intel 32-bit
assembly language, using the AT\&T syntax with Linux.

\section{Stacks}

Most CPU types base subroutine calls on the notion of a {\bf stack}.  An
executing program will typically have an area of memory defined for use
as a {\bf stack}.  Most CPUs have a special register called the {\bf
stack pointer} (SP in general, ESP on 32-bit Intel machines), which
points to the {\bf top} of the stack.  

Note carefully that the stack is not a ``special'' area of memory, say
with a ``fence'' around; wherever SP points, that is the stack.  And
that is true even if we aren't using a stack at all, in which case the
``stack'' is garbage.

Stacks typically grow toward memory address 0.\footnote{For this reason,
we usually draw the stack with memory address 0 at the top of the
picture, rather than the bottom.} If an item is added---{\bf
pushed}---onto the stack, the SP is decremented to point to the word
preceding the word it originally pointed to, i.e. the word with the
next-lower address than that of the word SP had pointed to beforehand.
Similarly SP is incremented if the item at the top of the stack is {\bf
popped}, i.e.  removed.  Thus the words following the top of the stack,
i.e. with numerically larger addresses, are considered the interior of
the stack.

Say for example we wish to push the value 35 onto the stack.  Intel assembly
language code to do this would be, for instance,

\begin{Verbatim}[fontsize=\relsize{-2}]
subl $4,%esp  # expand the stack by one word
movl $35,(%esp)  # put 35 in the word which is the new top-of-stack
\end{Verbatim}

However, typically a CPU will include a special instruction to do
pushes.  On Intel machines, for example, we could do the above using
just one instruction:\footnote{This would be not only easier to program,
but more importantly would also execute more quickly.}

\begin{Verbatim}[fontsize=\relsize{-2}]
push $35
\end{Verbatim}

(There is also a {\bf pushl} instruction, for consistency with
instructions like {\bf movl}, but it is just another name for {\bf
push}.  Note that this is because the items on the stack must be of word
size anyway.)  

By the way, instructions like

\begin{Verbatim}[fontsize=\relsize{-2}]
pushl w
\end{Verbatim}

where {\bf w} is a label in the {\bf .data} section, are legal, a rare
exception to Intel's theme of not allowing direct memory-to-memory
operations.

Similarly, if we wish to pop the stack and have the popped value placed
in, say, ECX, we would write

\begin{Verbatim}[fontsize=\relsize{-2}]
pop %ecx
\end{Verbatim}

Note carefully that it is the {\it stack} which is popped, not the ECX
register.

Keep in mind that a pop does not change memory.  The item is still
there; the only change is that it is not considered part of the stack
anymore.  And the stack pointer merely is a means for indicating what
\underline{is} considered part of the stack; by definition, the stack
consists of the word pointed to by ESP plus all words at higher
addresses than that.  The popped item will still be there in memory
until such time, if any, that another push is done, overwriting the
item.

A process' amount of stack space is set when the program is loaded.  One
can run the Linux commands {\bf limit} (C shell) and {\bf ulimit} (Bash
shell) to change this value.

\section{CALL, RET Instructions}

The actual call to a subroutine is done via a CALL instruction.
Execution of this instruction consists of two actions:

\begin{itemize}

\item The current value of the PC is pushed on the stack.  

\item The address of the subroutine is placed into the PC.

\end{itemize}

We say that the first of these two actions pushes the {\bf return
address} onto the stack, meaning the place that we wish to return to
after we finish executing the subroutine.  

Note that both of the actions above, like those of any instruction, are
performed by the hardware.

For example, consider the code

\begin{Verbatim}[fontsize=\relsize{-2}]
call abc
addl %eax,%ebx
\end{Verbatim}

At first the PC is pointing to the {\bf call} instruction.  After the
instruction is fetched from memory, the CPU, as usual, increments the PC
to point to the next instruction, i.e. the {\bf addl}.  From the first
bullet item above, we see that that latter instruction's address will be
pushed onto the stack --- which is good, because we do want to save that
address, as it is the place we wish to return to when the subroutine is
done.  So, the terminology used is that the {\bf call} instruction saves
the return address on the stack.

What about the second bullet item above?  Its effect is that we do a
jump to the subroutine.  So, what {\bf call} does is save the return
address on the stack and then start execution of the subroutine.

The very last instruction which the programmer puts into the subroutine
will be {\bf ret}, a return instruction.  It pops the stack, and places
that popped value into the PC.  Since the return address had been
earlier pushed onto the stack, we now pop it back off, and the return
address will now be in the PC --- so we are executing the instruction
immediately following the {\bf call}, such as the {\bf addl} in the
example above, just as desired.\footnote{This assumes we've been careful
to first pop anything we pushed onto the stack subsequent to the call.
More on this later.}


\section{Arguments}

Functions in C code usually have arguments (another term used is {\it
parameters}), and that is also true in assembly language.  The typical
way the arguments are passed to the subroutine is again via the stack.
We simply push the arguments onto the stack before the call, and then
pop them back off after the subroutine is done.  

For instance, in the example above, suppose the subroutine {\bf abc} has two
integer arguments, and in this particular instance they will have the
values 3 and 12.  Then the code above might look like

\begin{Verbatim}[fontsize=\relsize{-2}]
push $12
push $3
call abc
pop %edx  # assumes EDX isn't saving some other data at this time
pop %edx
addl %eax,%ebx
\end{Verbatim}

The subroutine then accesses these arguments via the stack.  Say for
example {\bf abc} needs to add the two arguments and place their sum in ECX.
It could be written this way:

\begin{Verbatim}[fontsize=\relsize{-2}]
...
movl 4(%esp),%ecx
addl 8(%esp),%ecx
...
ret
\end{Verbatim}

In that {\bf movl}, the source operand is 4 bytes past where SP is
pointing to, i.e. the next-to-top element of the stack.  In the call
example above, this element will be the 3.  (The top element of the
stack is the return address, since it was the last item pushed.)  The
second {\bf addl} picks up the 12.

Here's what the stack looks like at the time we execute that {\bf movl}:

\begin{tabular}{ll}
~ & argument 2 \\ 
~ & argument 1 \\ 
ESP $\rightarrow$ & saved return address \\
towards 0 $\downarrow$ & ~
\end{tabular}

\section{Ensuring Correct Access to the Stack}

But wait a minute.  What if the calling program above had also been
storing something important in ECX?  We must write {\bf abc} to avoid a
clash.  So, we first save the old value of ECX---where else but the
stack?---and later restore it.  So, the beginning and end of {\bf abc}
might look like this:

\begin{Verbatim}[fontsize=\relsize{-2}]
push %ecx
...
movl 8(%esp),%ecx
addl 12(%esp),%ecx
... # ECX used here
pop %ecx
ret
\end{Verbatim}

Note that we had to replace the stack offsets 4 and 8 by 8 and 12, to
adjust for the fact that one more item will be on the stack.  No one but
the programmer can watch out for this kind of thing.  The assembler and
hardware, for example, would have no way of knowing that we are wrong if
we were to fail to make this adjustment; we would access the wrong part
of the stack, and neither the assembler nor hardware would complain.

\section{Cleaning Up the Stack}

Also, note that just before the {\bf ret}, we ``clean up'' the
stack by popping it.  This is important for several reasons:

\begin{itemize}

\item Whatever we push onto the stack, we should eventually pop, to
avoid inordinate growth of the stack.  For example, we may be using
memory near 0 for something else, so if the stack keeps growing, it will
eventually overlap that other data, overwriting it.

\item We need to restore ECX to its state prior to the call to {\bf abc}.

\item We need to ensure that the return instance does return to the
correct place.\footnote{The reader should ponder what would happen if
the programmer here forgot to include that pop just before the return.}

\end{itemize}

Note that in our call to {\bf abc} above, we followed the call with some stack
cleanup as well:

\begin{Verbatim}[fontsize=\relsize{-2}]
pop %edx  # assumes EDX isn't saving some other data at this time
pop %edx
\end{Verbatim}

We needed to remove the two arguments which we had pushed before the
call, and these two pops do that.  

The pop operation insists that the popped value be placed somewhere, so
we use EDX in a ``garbage can'' role here, since we won't be using the
popped values.  Or, we could do it this way:

\begin{Verbatim}[fontsize=\relsize{-2}]
addl $8,%esp
\end{Verbatim}

This would be faster-executing and we wouldn't have to worry about EDX.

\section{Full Examples}
\label{full}

\subsection{First Example}

In the following modification of an example from Chapter \ref{chap:asm},
we again sum up elements of an array, but we handle initialization of
the registers by a subroutine.  We also allow starting the summmation in
the middle of the array, by specifying the starting point as an argument
to the subroutine, and allow specification that only a given number of
elements be summed:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.data
x:
      .long   1
      .long   5
      .long   2
      .long   18
      .long   8888
      .long   168
n:    .long 3
sum:
      .long 0

.text
# EAX will contain the number of loop iterations remaining
# EBX will contain the sum
# ECX will point to the current item to be summed

.globl _start
_start:
      # push the arguments before call:  place to start summing, and
      # number of items to be summed
      push $x+4  # here we specify to start summing at the 5
      push n  # here we specify how many to sum
      call init
      addl $8, $esp  # clean up the stack
top:  addl (%ecx), %ebx
      addl $4, %ecx
      decl %eax
      jnz top
done: movl %ebx, sum

init:
      movl $0, %ebx
      # pick up arguments from the stack
      mov 8(%esp),  %ecx
      mov 4(%esp),  %eax
      ret
\end{Verbatim}

Here is the output assembly listing:

\begin{Verbatim}[fontsize=\relsize{-2}]
GAS LISTING full.s 			page 1
 1              	.data
 2              	x:
 3 0000 01000000 	      .long   1
 4 0004 05000000 	      .long   5
 5 0008 02000000 	      .long   2
 6 000c 12000000 	      .long   18
 7 0010 B8220000 	      .long   8888
 8 0014 A8000000 	      .long   168
 9 0018 03000000 	n:    .long 3
10              	sum:
11 001c 00000000 	      .long 0
12              	
13              	.text
14              	# EAX will contain the number of loop iterations remaining
15              	# EBX will contain the sum
16              	# ECX will point to the current item to be summed
17              	
18              	.globl _start
19              	_start:
20              	      # push the arguments before call:  place to start summing, and
21              	      # number of items to be summed
22 0000 68040000 	      push $x+4  # here we specify to start summing at the 5
22      00
23 0005 FF351800 	      push n  # here we specify how many to sum
23      0000
24 000b E8110000 	      call init
24      00
25 0010 83C408   	      addl $8, %esp  # clean up the stack
26 0013 0319     	top:  addl (%ecx), %ebx
27 0015 83C104   	      addl $4, %ecx
28 0018 48       	      decl %eax
29 0019 75F8     	      jnz top
30 001b 891D1C00 	done: movl %ebx, sum
30      0000
31              	
32              	init:
33 0021 BB000000 	      movl $0, %ebx
33      00
34              	      # pick up arguments from the stack
35 0026 8B4C2408 	      mov 8(%esp),  %ecx
36 002a 8B442404 	      mov 4(%esp),  %eax
37 002e C3       	      ret

DEFINED SYMBOLS
              full.s:2      .data:00000000 x
              full.s:9      .data:00000018 n
              full.s:10     .data:0000001c sum
              full.s:19     .text:00000000 _start
              full.s:32     .text:00000021 init
              full.s:26     .text:00000013 top
              full.s:30     .text:0000001b done

NO UNDEFINED SYMBOLS
\end{Verbatim}

Note from line 24 of the assembly output listing that the CALL
instruction assembled to e811000000.  That breaks down to an op code of
e8 and a distance field of 11000000.  The latter (after accounting for
little-endianness) is the distance from the instruction after the CALL
to the subroutine, which from lines 25 and 33 can be seen to be
0x0021-0x0010 = 0x11.\footnote{This is what is known for Intel machines
as a {\bf near call}.  Back in the days when Intel CPUs could only
access 64K segment of memory at a time, they needed ``near''
(within-segment) and ``far'' (inter-segment) calls, but with the flat
32-bit addressing model used today, this is outmoded, and everything is
``near.''}

Note that in the call to {\bf addone()}, the argument was the address of
{\bf x}, not {\bf x} itself.  Thus the push instruction generated by the
compiler pushes {\bf \$x}, not {\bf x}.  By contrast, in the translation
of the call to {\bf printf()}, {\bf x} is pushed.  Note that the format
string is an argument too.

To gain a more concrete understanding of how the stack is used in
subroutine calls, let's use GDB to inspect what happens with the stack
when this program is run:

\begin{Verbatim}[fontsize=\relsize{-2}]
% gdb a.out
GNU gdb Red Hat Linux (5.2.1-4)
Copyright 2002 Free Software Foundation, Inc.
...
Breakpoint 1, _start () at SubArgsEx.s:24
24	      call init
Current language:  auto; currently asm
(gdb) p/x $eip
$1 = 0x804807f
(gdb) p/x $esp
$2 = 0xbfffcec8
(gdb) si
33	      movl $0, %ebx
(gdb) p/x $eip
$3 = 0x8048093
(gdb) p/x $esp
$4 = 0xbfffcec4
(gdb) x/3w $esp
0xbfffcec4:	0x08048084	0x00000003	0x080490a8
(gdb) p/x &x
$5 = 0x80490a4
\end{Verbatim}

So, the PC value changed from 0x804807f to 0x0x8048093, reflecting the
fact that a CALL does do a jump.

The CALL also does a push (of the return address), and sure enough, the
stack did expand by one word, as can be seen by the fact that ESP
decreased by 4, from 0xbfffcec8 to 0xbfffcec4.

The stack should now have the return address at the top, followed by the
value of {\bf n} and the address of {\bf x} plus 4.  Let's check that
return address.  The PC value just before the call had been, according
our GDB output above, 0x804807f, and since (as discovered above from the
output of {\bf as -a}) the CALL was a 5-byte instruction, the saved
return address should be 0x804807f+5 = 0x8048084, which is indeed what
we see on the top of the stack.  

By the way, you should be able to deduce the address at which the {\bf
.text} section begins.  Do you see how?

\checkpoint

\subsection{If the PC Points to Garbage, the Machine Will Happily
``Execute'' the Garbage}

Now look at the rest of our GDB session:

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) b done
Breakpoint 2 at 0x804808d: file SubArgsEx.s, line 30.
(gdb) c
Continuing.

Breakpoint 2, done () at SubArgsEx.s:30
30	done: movl %ebx, sum
(gdb) si
33	      movl $0, %ebx
(gdb) p sum
$6 = 25
(gdb) si
35	      mov 8(%esp),  %ecx
(gdb) q
The program is running.  Exit anyway? (y or n) y
\end{Verbatim}

After executing the instruction at {\bf done}, I checked to see if {\bf
sum} was correct, which it was.  At that point, I really should have
stopped, since the program was indeed done.  But I continued anyway,
issuing another {\bf si} command to GDB.  Did GDB refuse?  Did ``the
little man inside the computer'' refuse?  Heck, no!  Remember, it's just
a dumb machine.  The CPU has no idea that the instruction at {\bf done}
was the last instruction in the program.  So, it keeps going, executing
the next instruction in memory.  That instruction is the MOV at the
beginning of the subroutine {\bf init()}!  So, you can see that there is
nothing ``special'' about a subroutine.  There is no physical boundary
between modules of code, in this case between the calling module and
{\bf init()}.  It's all just a bunch of instructions. 

In the late 1980s, it was not common for personal computers to have
virtual memory hardware, nor an operating system (OS) to take advantage
of that hardware if it does exist.  Imagine what happened when a user
ran a recursive function, and through error, had not provision to
terminate the recursion.  Each call would push material on the stack,
resulting in the stack pointer getting closer and closer to 0.  But when
it reached 0, it would keep going!  Remember, -4 is simply a very large
positive number if viewed as unsigned, which addresses are.  So, the
program would keep going, and the stack would keep growing.

Eventually the stack would grow to a point at which a push operation
would write over the program itself!  This would eventually cause
garbage to be executed, which would likely result at some point in
execution of an ``instruction'' that has a nonexistent op code.  This
would crash the program.

Along the way, the stack would also overwrite the OS.  This would not
immediately cause a problem, as the OS is dormant while your program is
running.  If your program makes a system call, such as for input/output,
that by definition is a call to the OS, so trouble might arise then.
But the recursive function in our example would probably not do this.

\checkpoint

\subsection{Second Example}

The next example speaks for itself, through the comments at the top of
the file.  By the way, when you read through it, make sure you
understand what we are doing with the SHR instruction.

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
# the subroutine findfirst(v,w,b) finds the first instance of a value v in a
# block of w consecutive words of memory beginning at b, returning
# either the index of the word where v was found (0, 1, 2, ...) or -1 if
# v was not found; beginning with _start, we have a short test of the
# subroutine  

.data  # data section
x:
      .long   1
      .long   5
      .long   3
      .long   168
      .long   8888
.text  # code section
.globl _start  # required
_start:  # required to use this label unless special action taken
      # push the arguments on the stack, then make the call
      push $x+4  # start search at the 5
      push $4  # search 4 words
      push $168  # search for 168 
      call findfirst
done:
      movl %edi, %edi  # dummy instruction for breakpoint
findfirst:
      # finds first instance of a specified value in a block of words
      # EBX will contain the value to be searched for
      # ECX will contain the number of words to be searched 
      # EAX will point to the current word to search
      # return value (EAX) will be index of the word found (-1 if not found)
      # fetch the arguments from the stack
      movl 4(%esp), %ebx
      movl 8(%esp), %ecx  
      movl 12(%esp), %eax  
      movl %eax, %edx # save block start location
      # top of loop; compare the current word to the search value
top:  cmpl (%eax), %ebx
      jz found
      decl %ecx  # decrement counter of number of words left to search
      jz notthere  # if counter has reached 0, the search value isn't there
      addl $4, %eax  # otherwise, go on to the next word
      jmp top
found: 
      subl %edx, %eax  # get offset from start of block
      shrl $2, %eax  # divide by 4, to convert from byte offset to index
      ret
notthere:
      movl $-1, %eax
      ret
\end{Verbatim}

\section{Interfacing C/C++ to Assembly Language}

Programming in assembly language is very slow, tedious and unclear, so
we try to avoid it, sticking to high-level languages such as C.  But in
some cases we need to write {\it part} of a program in assembly
language, either because there is something which is highly
machine-dependent,\footnote{Note that to a large extent we can deal with
machine-dependent aspects even from C.  We can deal with the fact that
different machines have different word sizes, for example, by using C's
{\bf sizeof()} construct.  However, this is not the case for some other
situations.} or because we need extra program speed and this part of the
program is the main time consumer.

A good example is Linux.  Most of it is written in C, for convenience
and portability across machines, but small portions are written in
assembly language, to get access to certain specific features of the
given hardware.  When Linux is ported to a new type of hardware, these
portions must be rewritten, but fortunately they are small.

At first it might seem ``unnatural'' to combine C and assembly language.
But remember, both {\bf .c} and {\bf .s} files are translated to machine
language, so we are just combining machine language with machine
language, not unnatural at all.

\subsection{Example}
\label{ex1}

So, here we will see how we can interface C code to an assembly language
subroutine.  Here is our C code:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
// TryAddOne.c, example of interfacing C to assembly language
// paired with AddOne.s, which contains the function addone()
// compile by assembling AddOne.s first, and then typing
//
//    gcc -g -o tryaddone TryAddOne.c AddOne.o
//
// to link the two .o files into an executable file tryaddone
// (recall the gcc invokes ld)

int x;

main()

{  x = 7;
   addone(&x);
   printf("%d\n",x);  // should print out 8
   exit(1);
}

\end{Verbatim}

I wrote the function {\bf addone()} in assembly language.  In order to do so,
I needed to know how the C compiler was going to translate the call to
{\bf addone()} to machine/assembly language.  In order to determine this, I
compiled the C module with the -S option, which produces an assembly
language version of the compiled code:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
% gcc -S TryAddOne.c
% more TryAddOne.s
        .file   "TryAddOne.c"
        .version        "01.01"
gcc2_compiled.:
                .section        .rodata
.LC0:
        .string "%d\n"
.text
        .align 4
.globl main
        .type    main,@function
main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    $7, x
        subl    $12, %esp
        pushl   $x
        call    addone
        addl    $16, %esp
        subl    $8, %esp
        pushl   x
        pushl   $.LC0
        call    printf
        addl    $16, %esp
        subl    $12, %esp
        pushl   $1
        call    exit
.Lfe1:
        .size    main,.Lfe1-main
        .comm   x,4,4
        .ident  "GCC: (GNU) 2.96 20000731 (Red Hat Linux 7.1 2.96-85)"
%
\end{Verbatim}

There is quite a lot in that {\bf .s} output file, but the part of interest to
us here, for the purpose of writing {\bf addone()}, is this:

\begin{Verbatim}[fontsize=\relsize{-2}]
        pushl   $x
        call    addone
\end{Verbatim}

This confirms what we expected, that the compiler would produce code
that transmits the argument to {\bf addone()} by pushing it onto the
stack before the call.\footnote{Note, though, that the compiler also
first expanded the stack by 12 bytes, for no apparent reason.  More on
this below.}  So, we write {\bf addone()} so as to get the argument from
the stack:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
# AddOne.s, example of interfacing C to assembly language
#
# paired with TryAddOne.c, which calls addone
#
# assemble by typing
#
#    as --gstabs -o AddOne.o AddOne.s
#

# note that we do not have a .data section here, and normally would not;
# but we could do so, and if we wanted that .data section to be visible
# from the calling C program, we would have .globl lines for whatever
# labels we have in that section

.text

# need .globl to make addone visible to ld
.globl addone

addone:

   # will use EBX for temporary storage below, and since the calling 
   # module might have a value there, better save the latter on the stack
   # and restore it when we leave
   push %ebx

   # at this point the old EBX is on the top of the stack, then the
   # return address, then the argument, so the latter is at ESP+8

   movl 8(%esp), %ebx  

   incl (%ebx)  # need the (), since the argument was an address

   # restore value of EBX in the calling module
   pop %ebx

   ret
\end{Verbatim}

The comments here say we should save, and later restore, the EBX
register contents as they were prior to the call.  Actually, our {\bf
main()} here does not use EBX in this case (see the assembly language
version above), but since we might in the future want to call {\bf
addone()} from another program, it's best practice to protect EBX as we
have done here.

So, by the time we reach the {\bf movl} instruction, the stack looks
like this:

\begin{tabular}{ll}
~ & argument \\ 
~ & saved return address \\
ESP $\rightarrow$ & saved EBX value \\
towards 0 $\downarrow$ & ~
\end{tabular}


The {\bf movl} instruction thus needs to go 8 bytes deep into the stack
to pick up the argument, which was the address of the word to be
incremented.  Then the {\bf incl} instruction does the incrementing via
indirect addressing, again since the argument is an address.

The reader is urged to compile/assemble/link as indicated in the
comments in the {\bf .c} and {\bf .s} files above, and then execute the
program, to see that it does indeed work.  Seeing it actually happen
will increase your understanding!  

Note carefully that {\bf TryAddOne.c} and {\bf AddOne.s} are not
``programs'' by themselves.  They are each the source code for a
different portion of the same program, the executable file {\bf
tryaddone}.  This is no different from the situation in which we have C
source code in two {\bf .c} files, compile them into two {\bf .o} files,
and then link to make a single executable.

\subsection{Cleaning Up the Stack?}

Let's check whether the compiler fulfilled its ``civic responsibility
'' by cleaning up the stack after the call.  In the {\bf .s} file produced by
{\bf gcc -S} above, we saw this code:

\begin{Verbatim}[fontsize=\relsize{-2}]
        subl    $12, %esp
        pushl   $x
        call    addone
        addl    $16, %esp
        subl    $8, %esp
\end{Verbatim}

This is a bit odd.  Since we had only pushed one argument, i.e. one
word, the natural cleanup would have been to then pop one word, as follows:

\begin{Verbatim}[fontsize=\relsize{-2}]
        call    addone
        addl    $4, %esp
\end{Verbatim}

(or use a {\bf pop} instruction instead of the {\bf addl}).   In other
words, we should shrink the stack by 4 bytes after the call.  Yet the
net effect of the code generated by the compiler is to shrink the stack
by 16 - 8 = 8 bytes, not 4.

On the other hand, recall that with the instruction

\begin{Verbatim}[fontsize=\relsize{-2}]
        subl    $12, %esp
\end{Verbatim}

the compiler expanded the stack by 12 bytes {\it too much} before the
call.  In other words, the overall effect of the call has been to expand
the stack by 12 - (16-8) = 4 bytes!

Supposedly the GCC people designed things this way, to leave gaps
between one call and the next on the stack.  This has makes things a
little safer, as it is harder for a return address to be accidentally
overwritten.

\subsection{More Sections}

By the way, the compiler has used the {\bf .comm} directive to set up
storage for our global variable {\bf x}.  It asks for 4 bytes of space,
aligned on an address which is a multiple of 4.  (See Section
\ref{basics}.) The linker will later set things up so that the variable
{\bf x} will be stored in the {\bf .bss} section, which is like the {\bf
.data} section except that the data is uninitialized.  Recall that in
our program above, the declaration of {\bf x}, a global variable, was
simply 

\begin{Verbatim}[fontsize=\relsize{-2}]
int x;
\end{Verbatim}

If instead the declaration had been

\begin{Verbatim}[fontsize=\relsize{-2}]
int x = 28;
\end{Verbatim}

then {\bf x} would be in the {\bf .data} section.  

Also, the label {\bf .LC0} is in yet another kind of section, {\bf
.rodata} (``read-only data'').

The reader is strongly encouraged to run the Unix {\bf nm} command
on any executable file, say the compiled version of the C program here.
In the output of that command, symbols (names of functions and global
variables) are marked as T, D, B, R and so on, indicating that the
item is in the {\bf .text} section, {\bf .data} section, {\bf .bss}
section, {\bf .rodata} section, etc., and the addresses assigned to them
by the linker are shown.

In Linux, the stack section begins at address 0xbfffffff, and as
mentioned, grows toward 0.  The {\bf heap}, from which space is
allocated dynamically when your program calls {\bf malloc()} or invokes
the C++ operator {\bf new}, starts at 0xbffff000 and grows away from 0.

\subsection{Multiple Arguments}

What if {\bf addone()} were to have two arguments?\footnote{Note that
when the C compiler compiles {\bf TryAddOne.c} above, it does not know
how many arguments {\bf addone()} really has, since {\bf addone()} is in
a separate source file.  (The compiler won't even know whether that
source file is C or assembly code.)  It simply knows how many arguments
we have used in this call, which need not be the same.}  A look at the
output of the compilation of the call to {\bf printf()} above shows that
arguments are pushed in reverse order, in this case the second before
the first.  So, if {\bf addone()} were to have two arguments, we would
have to write the code for the function accordingly, making use of the
fact that the first argument will be closer to the top of the stack than
the second.

\subsection{Nonvoid Return Values}
\label{nonvoid}

The above discussion presumes that return value for the assembly
language function is {\bf void}.  If this is not the case, then the
assembly language function must place the return value in the EAX
register.  This is because the C compiler will place code after the call
to the function which picks up the return value from EAX.

That assumes that the return value fits in EAX, which is the case for
integer, character or pointer values.  It is not the case for the type
{\bf long long} in GCC, implemented on 32-bit Intel machines in 8 bytes.
If a function has type {\bf long long}, GCC will return the value in the
EDX:EAX pair, similar to the IMUL case.

The case of {\bf float} return values is more complicated.  The Intel
chip has separate registers and stack for floating-point operations.
See Chapter \ref{chap:arithlog}.

\subsection{Calling C and the C Library from Assembly Language}

The same principles apply if one has a C function which one wishes to
call from assembly language.  We merely have to take into account the
order of parameters, etc., verifying by running {\bf gcc -S} on the C
function.

A bit more care must be taken if we wish to call a C library function,
e.g. {\bf printf()}, from assembly language.  Your call is the same, of
course, but the question is how to do the linking.  The easiest way to
do this is to actually use GCC, because it will automatically
handle linking in the C library, etc.

Here is an example, again from our first assembly language example of
summing up four array elements:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.data

x:
      .long   1
      .long   5
      .long   2
      .long   18

sum:
      .long 0

fmt:  .string "%d\n"

.text

.globl main
main:
      movl $4, %eax  # EAX will serve as a counter for
                     # the number of words left to be summed
      movl $0, %ebx  # EBX will store the sum
      movl $x, %ecx  # ECX will point to the current
                     # element to be summed
top:  addl (%ecx), %ebx
      addl $4, %ecx  # move pointer to next element
      decl %eax  # decrement counter
      jnz top  # if counter not 0, then loop again
printsum:
      pushl %ebx
      pushl $fmt
      call printf
done: movl %eax, %eax
\end{Verbatim}

We wanted to print out the sum which was in EBX.  We know that {\bf
printf()} has as its first parameter the output format, which
is a character string, with the other parameters being the items to
print.  We only have one such item here, EBX.  So we push it, then
push the address of the format string, then call.

We then run it through GCC.

\begin{Verbatim}[fontsize=\relsize{-2}]
gcc -g -o sum sum.s
\end{Verbatim}

Given that we have chosen to use GCC (which, recall, we did in order to
avoid having to learn where the C library is, etc.), this forced us to
chose {\bf main} as the label for the entry point in our program,
instead of {\bf \_start}.  This is because GCC will link in the C
startup library.  The label {\bf \_start} is actually in that library,
and when our program here, or any other C program, begins execution, it
actually will be at that label.  The code at that label will do some
initialization and then will call {\bf main}.  So, we had better have a
label {\bf main} in our code!

Note that the C library function that you call may use EAX, ECX and EDX!
For example, most C library functions have return values, typically
success codes, and of course they are returned in EAX.  Recall from
Section \ref{calling} that if you write a C function, you don't have to
worry about the calling module having ``live'' values in EAX, ECX and
EDX---that is, if the calling module is in C.  Here it isn't.

% \subsubsection{The Example Gives Us More Insight into C}
% 
% I once had a student who asked me if the following code would allocate
% space for two arrays:
% 
% \begin{Verbatim}[fontsize=\relsize{-2}]
% char *x,*y;
% ...
% u = "abcdef";
% \end{Verbatim}
% 
% Most of you will immediately say that space is allocated for the first
% array, but not the second, and you are correct.  But why?
% 
% When the compiler sees the string {\tt "abcdef"} in the source code,
% it will place an item in the {\bf .data} section, e.g.
% 
% \begin{Verbatim}[fontsize=\relsize{-2}]
% .data
% ...
% .LC0:
%         .string "abcdef"
% \end{Verbatim}
% 
% Then for the assignment of {\bf u}, the compiler will produce code like
% 
% \begin{Verbatim}[fontsize=\relsize{-2}]
% movl    $.LC0, u
% \end{Verbatim}
% 
% in the {\bf .text} section.
% 
% The key point is that the compiler did set up space when it saw the
% string.  
% 
\subsection{Local Variables}
\label{locals}

Not only does the compiler use the stack for storage of arguments, it
also stores local variables there.  Consider for example

\begin{Verbatim}[fontsize=\relsize{-2}]
int sum(int *x, int n)
{  int i=0,s=0;

   for (i = 0; i < n; i++)
      s += x[i];
   return s;
}
\end{Verbatim}

As mentioned in Chapter \ref{chap:datarep}, the locals will be stored in
``reverse'' order:  The two variables will be in adjacent words of
memory, but the first one, i.e. the one at the lower address, will be
{\bf s}, not {\bf i}.  As you will see below, this will be done by in
essence pushing {\bf i} and then pushing {\bf s} onto the stack.

Let's explore this by viewing the compiled code (in assembly language
form) for this by running {\bf gcc -S}.  Here is an excerpt from the
output:\footnote{Different versions of GCC may produce somewhat
different code.}

\begin{Verbatim}[fontsize=\relsize{-2}]
...
sum:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    $0, -8(%ebp)
        movl    $0, -4(%ebp)
\end{Verbatim}

You can see that the compiler first put in code to copy ESP to EBP, a
standard operation which is explained below in Section \ref{ebpuse}, and
then put in code to expand the stack by 8 bytes for the two local
variables.  set both locals to 0.\footnote{Different versions of {\bf
gcc} may not produce the same code.  Always run {\bf gcc -S} before
doing any interfacing of C to assembly language.}  Note that the local
variables are not really ``pushed'' onto the stack; there is simply room
made for them on the stack.  Note also that later in the compiled code
for {\bf sum()} (not shown here), the compiler needed to insert code to
pop off those local variables from the stack before the {\bf ret}
instruction; without this, the {\bf ret} would try to jump to the place
pointed to by the first local variable---total nonsense.

\checkpoint

\subsection{Use of EBP} 
\label{ebpuse}

\subsubsection{GCC Calling Convention}
\label{calling}

Recall that in Chapter \ref{chap:asm}, we mentioned
that you should not use ESP for general storage if you are using
subroutines.  It should now be clear why we noted that restriction.
Now, here is a new restriction:  If you are interfacing assembly
language to C/C++, you should avoid using the EBP register for general
storage.  In this section, we'll see why.

Note the first three instructions in the implementation of {\bf sum()}
above:

\begin{Verbatim}[fontsize=\relsize{-2}]
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
\end{Verbatim}

This {\bf prologue} is standard for C compilers on Intel machines.  The
old value of EBP is pushed on the stack; the current value of ESP is
saved in EBP; and the stack is expanded (by decreasing ESP) to allocate
space for the locals.   

If you use GCC, then GCC will make sure that the calling module will not
have any ``live'' values in EAX, ECX or EDX.  So, the called module need
not save the values in any of these registers.  But if the called module
uses ESI, EDI or EBX, the called module must save these in the prologue
too.

These are referred in the computer world as {\bf calling conventions},
a set of guidelines to follow in writing subroutines for a given
machine under a given compiler.

\subsubsection{The Stack Frame for a Given Call}

Consider the following scenario: {\bf f()} calls {\bf g()}, which in
turn calls {\bf h()}.  At any given time, the stack can be viewed as
being broken down in a portion for each current function call.  Each
function in that call chain will be accessing a certain section of the
stack, call the {\tt stack frame} for that function.  So in the above
scenario, {\bf f()} would have a stack frame, below which (i.e. toward
memory address 0) would be {\bf g()}'s stack frame, below which in turn
would be that for {\bf h()}.

Consider {\bf g()} above.  The stack frame for {\bf g()} looks like this
(in order of decreasing addresses):

\begin{itemize}

\item [(a)] a word containing a pointer to {\bf f()}'s stack frame 

\item [(b)] {\bf g()}'s local variables

\item [(c)] extra space used by {\bf g()} for scratch space via
pushes, if any

\item [(d)] arguments for the call to {\bf h()}

\item [(e)] return address to get back to {\bf g()} from {\bf h()}

\end{itemize}

Portions (a) and (b) exist immediately after the prologue of {\bf g()} 
is executed.  Portion (c) may or may not be added later,
depending on whether the author of this function decides to do some
pushes.  Portions (d) and (e) will come from {\bf h()} is called.

At any given time during the execution of {\bf g()} (after the
prologue), EBP points to the beginning of {\bf g()}'s stack frame, and
ESP points to the end of that frame.  As you can see, the end of the
frame can move during this process.

Here is what {\bf g()}'s stack frame (and a bit of {\bf f()}'s and {\bf
h()}'s) will consist of, just after the CALL is executed but before {\bf
h()}'s prologue begins executing: 

\begin{tabular}{ll}
~ & address in {\bf f()} to return to when {\bf g()} is done (end of
{\bf f()}'s frame) \\
EBP $\rightarrow$ & address of {\bf f()}'s stack frame (start of {\bf
g()}'s frame) \\
~ & {\bf g()}'s first declared local variable \\
~ & {\bf g()}'s second declared local variable \\
~ & ... \\
~ & {\bf g()}'s last declared local variable \\
~ & any stack space {\bf g()} is currently using as "scratch area" \\
~ & last argument in {\bf g()}'s call to {\bf h()} \\
~ & ... \\
~ & second argument in {\bf g()}'s call to {\bf h()} \\
~ & first argument in {\bf g()}'s call to {\bf h()} \\
ESP $\rightarrow$ & address in {\bf g()} to return to when {\bf h()} is
done (end of {\bf g()'s frame}) \\
~ & address of {\bf g()}'s stack frame (start of {\bf h()}'s frame) \\
towards 0 $\downarrow$ & ~
\end{tabular}

Of course, after we execute {\bf h()}'s prologue, ESP will change again,
and will demarcate the end of {\bf h()}'s stack frame (and EBP will
demarcate the beginning of that frame).  Once {\bf h()} finishes
execution, the stack will shrink back again, and ESP will increase and
demarcate the end of {\bf g()}'s frame again.  

\subsubsection{The Stack Frames Are Chained}

As noted above, in any function's stack frame, the first (i.e.
highest-address) element will contain a pointer to the beginning of the
caller's stack frame.  But the beginning of the caller's frame will in
turn point to the beginning of {\it its} caller's frame, and so on.  In
that sense, the beginning elements of the various stack frames form a
linked list, enabling one to trace through the stack information in a
chain of nested calls.  And since we know that EBP points to the current
frame, we can use that as our starting point in traversing this chain.

The authors of GDB made use of this fact when they implemented GDB's
{\bf bt} (``backtrace'') command.  Let's review that command.  Consider
the following example:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
void h(int *w)
{  int z;
   *w = 13 * *w;
}

int g(int u)
{  int v;
   h(&u);
   v = u + 12;
   return v;
}

main()
{  int x,y;
   x = 5;
   y = g(x);
}
\end{Verbatim}

Let's execute it in GDB:

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) b h
Breakpoint 1 at 0x804833a: file bt.c, line 3.
(gdb) r
Starting program: /root/50/a.out

Breakpoint 1, h (w=0xfef5d2ac) at bt.c:3
3          *w = 13 * *w;
(gdb) bt
#0  h (w=0xfef5d2ac) at bt.c:3
#1  0x08048360 in g (u=5) at bt.c:8
#2  0x0804839c in main () at bt.c:16
\end{Verbatim}

We are now in {\bf h()} (shown as frame 0 in the {\bf bt} output),
having called it from location 0x08048360 in {\bf g()} (frame 1), which
in turn had been called from location 0x0804839c in {\bf main()} (frame
2).  

And GDB allows us to temporarily change our context to another frame,
say frame 1, and take a look around:

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) f 1
#1  0x08048360 in g (u=5) at bt.c:8
8          h(&u);
(gdb) p u
$1 = 5
(gdb) p v
$2 = 0
\end{Verbatim}

Make sure you understand how GDB---which, remember, is itself a
program---is able to do this.  It does it by inspecting the stack frames
of the various functions, and the way it gets from one frame to another
within the stack is by the fact that the first element in a function's
stack frame is a pointer to the first element of the caller's stack
frame.

By using the structure shown above, the compiler is ensuring that we
will always be able to get to return addresses, arguments and so on of
``ancestral'' calls.

%  The C library includes functions that enable the C/C++ programmer to
%  access, for instance, {\bf v()}'s arguments from within {\bf w()}.  The
%  need for this arises when a function (``{\bf v()}'') has a variable
%  number of arguments, which is common in C/C++.  For example, the C
%  library function {\bf printf()} is of this nature.  In
%  
%  \begin{Verbatim}[fontsize=\relsize{-2}]
%  printf("%d\n",x);
%  printf("%d %c\n",y,w);
%  \end{Verbatim}
%  
%  the first call has two arguments while the second call has three.
%  In order to handle this, there is a set of functions (actually macros)
%  in the C library, which you access via
%  
%  \begin{Verbatim}[fontsize=\relsize{-2}]
%  #include <stdarg.h>
%  \end{Verbatim}
%  
%  One such function (actually a macro) is {\bf va\_arg()}.  It plays a
%  role like {\bf w()} above, while here {\bf printf()} is like {\bf v()}
%  above.
%  
%  These functions will rely upon the fact that no matter how the compiler
%  has stored the local variables, we know how to get to the arguments,
%  which functions like {\bf va\_arg()} need to do.  We simply go to
%  c(EBP)+8, etc.

\subsubsection{ENTER and LEAVE Instructions}

We've noted before that after return from a function call, the caller
should clean up the stack, i.e. remove the arguments it had pushed onto
the stack before the call.  Similarly, within the function itself there
should be a cleanup, to remove the locals and make sure that EBP is
adjusted properly.  That means:  restoring ESP and EBP to the values
they had before the prologue: 

\begin{Verbatim}[fontsize=\relsize{-2}]
movl %ebp, %esp
popl %ebp
\end{Verbatim}

Let's call this the ``epilogue.''  The prologue and epilogue codes are
some common that Intel included ENTER and LEAVE instructions in the chip
to do them.  The above prologue, a three-instruction sequence which set
8 bytes of space for the locals, would be done, for instance, by the
single instruction

\begin{Verbatim}[fontsize=\relsize{-2}]
enter 8, 0
\end{Verbatim}

(The 0 is never used in flat memory mode.)

The two-instruction epilogue above can be done with the single
instruction

\begin{Verbatim}[fontsize=\relsize{-2}]
leave
\end{Verbatim}

\subsubsection{Application of Stack Frame Structure}

The following code implements a ``backtrace'' like GDB's.  A user
program calls {\bf getchain()}, which will return a list of the return
addresses of all the current stack frames.  It is required that the user
program first call the function {\bf initbase()}, which determines the
address of {\bf main()}'s stack frame and stores it in {\bf base}.  It
is assumed that {\bf getchain()} will not be called from {\bf main()}
itself.

The C prototype is 

\begin{Verbatim}[fontsize=\relsize{-2}]
void getchain(int *chain, int *nchain)
\end{Verbatim}

with the space for the array {\bf chain} and its length {\bf nchain}
provided by the caller.

In reading the code, keep in mind this picture of the stack after
the three pushes are done: 

\begin{tabular}{ll}
~ & address of {\bf nchain} \\ 
~ & address of {\bf chain} \\ 
~ & saved return address from {\bf getchain()} back to the caller \\
~ & saved EAX \\
~ & saved EBX \\
ESP $\rightarrow$ & saved ECX \\
towards 0 $\downarrow$ & ~
\end{tabular}

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.data
base:  # address of main()'s stack frame 
   .long 0

.text
.globl initbase, getchain
initbase:  # initializes base
   movl %ebp, base
   ret
getchain:
   # register usage:
   #    EAX will point to the current element of chain to be filled
   #    EBX will point to the frame currently being examined
   push %eax
   push %ebx
   push %ecx
   # get address of chain
   movl 16(%esp), %eax  
   # get return address from this subroutine, getchain() 
   movl 12(%esp), %ecx
   # write it to the first element of the chain
   movl %ecx, (%eax)
   addl $4, %eax
   # EBP still points to the caller's frame, perfect since we don't want
   # to include the frame for this subroutine
   movl %ebp, %ebx
   # loop through all the frames
top:
   # if this is main()'s frame, then done so leave 
   cmpl %ebx, base
   jz done
   # get return address for this frame
   movl 4(%ebx), %ecx
   # write it to chain
   movl %ecx, (%eax)
   addl $4, %eax
   # go to next frame
   movl (%ebx), %ebx
   jmp top
done:
   # find nchain, by subtracting start of chain from EAX and adjusting
   subl 16(%esp), %eax
   shrl $2, %eax
   movl 20(%esp), %ebx
   movl %eax, (%ebx)
   pop %ecx
   pop %ebx
   pop %eax
   ret
\end{Verbatim}

Example of calls:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
int chain[5],nchain;

void h(int *w)
{  int z;
   int chain[10],nchain;
   z = *w +28;
   getchain(chain,&nchain);
   *w = 13 * z;
}

int g(int u)
{  int v;
   h(&u);
   v = u + 12;
   getchain(chain,&nchain); 
   return v;
}

main()
{  int x,y;
   initbase();
   x = 5;
   y = g(x);
}
\end{Verbatim}

\checkpoint

\subsection{The LEA Instruction Family}

The name of the LEA instruction family stands for Load Effective
Address.  Its action is to compute the memory address of the first
operand, and store that address in the second.  

For example, the instruction

\begin{Verbatim}[fontsize=\relsize{-2}]
leal    -4(%ebp), %eax
\end{Verbatim}

computes -4 + c(EBP) and places the result in EAX.  If we had a single
local variable {\bf z} within a C function, the above instruction would
be computing {\bf \&z} and placing it in EAX.  The compiler often uses
this technique.

\subsection{The Function main() IS a Function, So It Too Has a Stack
Frame}
\label{mainisafunction}

It's important to keep in mind that {\bf main()} is a function too, so
it has information stored on the stack.  Recall that a typical
declaration of {\bf main()} is

\begin{Verbatim}[fontsize=\relsize{-2}]
int main(int argc, char **argv)
\end{Verbatim}

This clearly shows that {\bf main()} is a function.  Note that we've
given {\bf main()} an {\bf int} return value, typically used as a
success code, again illustrating the fact   that {\bf main()} is a
function.

Let's take a closer look.  When you compile your program, the compiler
(actually linker) puts in some C library code which is used for startup.
Just like your assembly language programs, there is a label there, {\bf
\_start}, at which execution begins.  The code there will prepare the
stack, including the {\bf argc} and {\bf argv} arguments, and will then
call {\bf main()}.  

Note by the way that we are not required to give these formal parameters
the names {\bf argc} and {\bf argv}.  It is merely customary.  As a
veteran C programmer, you know that you can name the formal parameters
whatever you want.  When any subroutine is called, the caller, in this
case the C library startup code, neither knows nor cares what the names
of the formal parameters are in the module in which the subroutine is
defined.  

Accordingly, upon entry to {\bf main()}, the stack will consist of the
return address, then {\bf argc}, then {\bf argv}, and space will be made
on the stack for any local variables {\bf main()} may have.  To
illustrate that, consider the code

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
main(int argc, char **argv)
{  int i;
   printf("%d  %s\n",argc,argv[1]);
}
\end{Verbatim}

After applying {\bf gcc -S} to this, we get

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
	.file	"argv.c"
	.section	.rodata
.LC0:
	.string	"%d  %s\n"
	.text
	.align 2
.globl main
	.type	main,@function
main:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$8, %esp
	andl	$-16, %esp
	movl	$0, %eax
	subl	%eax, %esp
	subl	$4, %esp
	movl	12(%ebp), %eax
	addl	$4, %eax
	pushl	(%eax)
	pushl	8(%ebp)
	pushl	$.LC0
	call	printf
	addl	$16, %esp
	leave
	ret
.Lfe1:
	.size	main,.Lfe1-main
	.ident	"GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"
\end{Verbatim}

We see that {\bf argc} and {\bf argv} are 8 and 12 bytes from EBP,
respectively.  This makes sense, since from Section \ref{ebpuse} we
know that upon entry to {\bf main()}, the stack headed by the place
pointed to by EBP will look like this:

\begin{Verbatim}[fontsize=\relsize{-2}]
argv
argc
return address
saved EBP
\end{Verbatim}

At the end of {\bf main()}, the GCC compiler will put the return value
in EAX, and will produce a LEAVE instruction.

Let's look at this from one more angle, by running this program via GDB:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
% gcc -S argv.c
% as --gstabs -o argv.o argv.s
% gcc -g argv.o
% gdb -q a.out
(gdb) b 10
Breakpoint 1 at 0x8048328: file argv.s, line 10.
(gdb) r abc def
Starting program:
/fandrhome/matloff/public_html/matloff/public_html/50/PLN/a.out abc def

Breakpoint 1, main () at argv.s:10
10              pushl   %ebp
Current language:  auto; currently asm
(gdb) x/3x $esp
0xbfffe33c:     0x42015967      0x00000003      0xbfffe384
(gdb) x/3x 0xbfffe384
0xbfffe384:     0xbfffe95c      0xbfffe99c      0xbfffe9a0
(gdb) x/s 0xbfffe95c
0xbfffe95c:
"/fandrhome/matloff/public_html/matloff/public_html/50/PLN/a.out"
(gdb) x/s 0xbfffe99c
0xbfffe99c:      "abc"
(gdb) x/s 0xbfffe9a0
0xbfffe9a0:      "def"
\end{Verbatim}

So we see that the return address to the C library is 0x42015967, {\bf
argc} is 3 and {\bf argv} is 0xbfffe384.  The latter should be a pointer
to an array of three strings, which is confirmed in the subsequent GDB
commands.

By the way, the instruction

\begin{Verbatim}[fontsize=\relsize{-2}]
andl	$-16, %esp
\end{Verbatim}

forces the top of the stack to be located at an address which is a
multiple of 16.  (It zeros out the last 4 bits.)  The hardware structure
makes memory accesses at such addresses most efficient.

\checkpoint

By the way, the C library code provides a third argument to {\bf main()}
as well, which is a pointer to the environment variables (current
working directory, executable search path, username, etc.).  Then the
declaration would be

\begin{Verbatim}[fontsize=\relsize{-2}]
int main(int argc, char **argv, char **envp)
\end{Verbatim}

\subsection{Once Again, There Are No Types at the Hardware Level!}

You have been schooled---properly so---in your beginning programming
classes about the importance of {\bf scope}, meaning which variables are
accessible or inaccessible from which parts of a program.  But if a
variable is inaccessible from a certain point in a program, that is
merely the compiler doing its gatekeeper work for that language.  It is
NOT a restriction by the hardware.  There is no such thing as scope at
the hardware level.  ANY instruction in a program---remember, all C/C++
and assembly code gets translated to machine instructions---can access
ANY data item ANYWHERE in the program.

In C++, you were probably taught a slogan of the form 

\begin{quote}

``A {\bf private} member of a class cannot be
accessed from anywhere outside the class.''  

\end{quote}

But that is not true.  The correct form of the statement should be

\begin{quote}

``The compiler {\it will refuse to compile} any C++ code you write which
attempts to access {\it by name} a {\bf private} member of a class from
anywhere outside the class.''  

\end{quote}

Again, the gatekeeper here is the compiler, not the hardware.
The compiler's actions are desirable, because the notion of scope helps
us to organize our data, but it has no physical meaning.  Consider this
example:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
#include <iostream.h>

class c {
   private:
      int x;
   public:
      c();  // constructor
      // printx() prints out x
      void printx() { cout << x << endl; }
};

c::c()
{  x = 5;  }

int main(int argc, char *argv[])
{  c ci;
   ci.printx();  // prints 5
   // now point p to ci, thus to the first word at ci, i.e. x
   int *p = (int *) &ci;
   *p = 29;  // change x to 29
   ci.printx();  // prints 29
}
\end{Verbatim}

The point is that the member variable {\bf x} in the class {\bf c},
though {\bf private}, is nevertheless in memory, and we can get to any
memory location by setting a pointer variable to that location.  That
is what we have done in this example.  We've managed to change the value
of {\bf x} in an instance of the class through code which is outside the
class.

The point also applies to local variables.  We can actually access a
local variable in one function from code in another function.  The
following example demonstrates that (make sure to review Section
\ref{ebpuse} before reading the example):

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
void g()
{  int i,*p;
   p = &i;
   p = p + 1;  // p now points to main()'s EBP
   p = *p;  // p now equals main()'s EBP value
   p = p - 1;  // p now points to main()'s x
   *p = 29;  // changes x to 29
}

main()
{  int x = 5;  // x is local to main()
   g();
   printf("%d\n",x);  // prints out 29
}
\end{Verbatim}

\checkpoint

\subsection{What About C++?}

If the calling program is C++ instead of C, you must inform the compiler
that the assembly language routine is ``C style,'' by inserting 

\begin{Verbatim}[fontsize=\relsize{-2}]
extern "C" void addone(int *);
\end{Verbatim}

(for our earlier example) in the C++ source file.  You need to do this,
because your assembly language routine will be in the C style, i.e.
utilize C conventions such as that concerning EBP above.

In the case of {\bf instance} (i.e. non-{\bf static}) functions, note
that the {\bf this} pointer is essentially an argument too.  The
convention is that it is pushed last, i.e. it is treated as the first
argument.

\subsection{Putting It All Together}

To illustrate a number of the concepts we've covered here, let's look at
the full assembly code produced from the function {\bf sum()} in Section
\ref{locals}.  Here is the original C and then the compiled code:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
int sum(int *x, int n)
{  int i=0,s=0;

   for (i = 0; i < n; i++)
      s += x[i];
   return s;
}
\end{Verbatim}

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
        .file   "sum.c"
        .text
        .align 2
.globl sum
        .type   sum,@function
sum:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    $0, -4(%ebp)
        movl    $0, -8(%ebp)
        movl    $0, -4(%ebp)
.L2:
        movl    -4(%ebp), %eax
        cmpl    12(%ebp), %eax
        jl      .L5
        jmp     .L3
.L5:
        movl    -4(%ebp), %eax
        leal    0(,%eax,4), %edx
        movl    8(%ebp), %eax
        movl    (%eax,%edx), %edx
        leal    -8(%ebp), %eax
        addl    %edx, (%eax)
        leal    -4(%ebp), %eax
        incl    (%eax)
        jmp     .L2
.L3:
        movl    -8(%ebp), %eax
        leave
        ret
.Lfe1:
        .size   sum,.Lfe1-sum
        .ident  "GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"
\end{Verbatim}

We see that the compiler has first placed the standard prologue at the
beginning of the code, allowing for two local variables:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
sum:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
\end{Verbatim}  

It then initializes both of those locals to 0.

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
        movl    $0, -4(%ebp)
        movl    $0, -8(%ebp)
\end{Verbatim}

Since our code sets {\bf i} to 0 twice, the compiler does so too,
since we didn't ask the compiler to optimize.

Our {\bf for} loop compares {\bf i} to {\bf n}, which is done
here:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
       movl    -4(%ebp), %eax
       cmpl    12(%ebp), %eax
       jl      .L5
       jmp     .L3
\end{Verbatim}

The code must add {\bf x[i]} to the sum:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.L5:
        movl    -4(%ebp), %eax
        leal    0(,%eax,4), %edx
        movl    8(%ebp), %eax
        movl    (%eax,%edx), %edx
        leal    -8(%ebp), %eax
        addl    %edx, (%eax)
\end{Verbatim}

Note the use of the LEA instruction and advanced addressing modes.

We need to increment {\bf i} and go to the top of the loop:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
        leal    -4(%ebp), %eax
        incl    (%eax)
        jmp     .L2
\end{Verbatim}

When the function is finished, it needs to put the sum in EAX, as
described in Section \ref{nonvoid}, clean up the stack and return:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.L3:
        movl    -8(%ebp), %eax
        leave
        ret
\end{Verbatim}

Note the use of the LEAVE instruction.

Now, what if we declare that local variable {\bf s} as {\bf static}?

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
int sum(int *x, int n)
{  int i=0; static s=0;

   for (i = 0; i < n; i++)
      s += x[i];
   return s;
}
\end{Verbatim}

Recall that this means that the variable will retain its value from one
call to the next.\footnote{Our code initializes {\bf s} to 0, but this
is done only once, as you can see from the assembly language here.} That
means that the compile can't use the stack for storage of this variable,
as it would likely get overwritten by calls from other subroutines.  So,
it must be stored in a {\bf .data} section:

\checkpoint

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
	.file	"Sum.c"
	.data
	.align 4
	.type	s.0,@object
	.size	s.0,4
s.0:
	.long	0
	.text
	.align 2
.globl sum
	.type	sum,@function
sum:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$4, %esp
	movl	$0, -4(%ebp)
	movl	$0, -4(%ebp)
.L2:
	movl	-4(%ebp), %eax
	cmpl	12(%ebp), %eax
	jl	.L5
	jmp	.L3
.L5:
	movl	-4(%ebp), %eax
	leal	0(,%eax,4), %edx
	movl	8(%ebp), %eax
	movl	(%eax,%edx), %eax
	addl	%eax, s.0
	leal	-4(%ebp), %eax
	incl	(%eax)
	jmp	.L2
.L3:
	movl	s.0, %eax
	leave
	ret
.Lfe1:
	.size	sum,.Lfe1-sum
	.ident	"GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"
\end{Verbatim}

You can tell from the code

\begin{Verbatim}[fontsize=\relsize{-2}]
	movl	s.0, %eax
	leave
	ret
\end{Verbatim}

that our C variable {\bf s} is being stored at a label {\bf s.0} in
the {\bf .data} section.

\section{Subroutine Calls/Returns Are ``Expensive''}

On most machines, subroutine calls and returns add overhead.  Not only
does it mean that extra instructions must be executed, but also stack
access, being memory access, is slow.  So, it makes sense to ask how
CPUs could be designed to speed this up.

One approach would be to make a special cache for the stack.  Note that
since the stack is in memory, some of it may be in an ordinary cache,
but by devoting a special cache to the stack, we would decrease the
cache miss rate for the stack.  

But a more aggressive approach would be to design the CPU so that the
top few stack elements are in the CPU itself.  This is done in Sun
Microsystems' SPARC chip, and was done back in the 60s and 70s on
Burroughs mainframes.  This would be better than a special cache,
since the latter, with its complex circuitry for miss processing and so
on, would have more delay.

On the other hand, space on a chip is precious.  It may be that such
usage of the space may not be as good as some other usage. 

\section{Debugging Assembly Language Subroutines}

\subsection{Focus on the Stack}

At first, debugging assembly language subroutines would seem to be
similar to debugging assembly language in general, and for that matter,
debugging any kinds of code.  However, what makes the assembly-language
subroutine case special is the use of the stack.  Many bugs involve
errors in accessing the stack, such as failure to restore a register
value which had been saved on the stack.

Recall that the essence of good debugging is \underline{confirmation}.
One uses the debugger, say {\bf ddd}, to step through one's code line by
line, and in each line \underline{confirms} that our stored data
(variables in C/C++, register and memory values in assembly language)
have the values we expect them to.  In stepping through our code, we
also \underline{confirm} that we execute the lines we expect to be
executed, such as a conditional branch being taken when we expect it to
be taken.  Eventually this process will lead us to a line in which we
find that our confirmation fails.  This then will pinpoint where our bug
is, and we can start to analyze what is wrong with that line.

For assembly language subroutines, our confirmation process consists of
\underline{confirming} that the contents of the stack are exactly as we
expect them to be.  Eventually we will find a line of code at which that
confirmation fails, and we then will have pinpointed the location of our
bug, and we can start to analyze what is wrong with that line.

To do that, you have to know how to inspect the stack from within your
debugger, e.g. {\bf ddd}.  The key point is that the stack is part of
memory, and the debugger allows you to inspect memory.  So, first find
out from the debugger the current value of ESP, and then inspect a few
words of memory starting at wherever ESP points to, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
x/4w $esp
\end{Verbatim}

to inspect the first four words of the stack in GDB.\footnote{You can
now see an advantage of designing the hardware so that stacks grow
toward 0.  This causes consecutive elements of the stack to be in
consecutive locations, making it easier to inspect the stack.}

Note that if during a debugging session you run your program several
times without exiting GDB,\footnote{And this is the proper way to do it.
You SHOULD stay in GDB from one run to the next, so as to retain your
breakpoints etc.} the stack will keep growing.  Thus any stack addresses
which you jot down for later use may become incorrect.

\subsection{A Special Consideration When Interfacing C/C++ with Assembly
Language}

When you are interfacing C/C++ to assembly language and while debugging
you reach a function call in your C/C++ code, it is often useful to view
the assembly language, especially the addresses of the instructions.
You may find GDB's {\bf disassemble} command useful in this regard.  It
reverse-translates the machine code for the program into assembly
language (regardless of whether the program source was originally
assembly language or C/C++), and lists the assembly language
instructions and their addresses.  Note that the latter will be
addresses with respect to the linker's assignments of the {\bf .text}
section location, not offsets as in the output of {\bf as -a}.  Also
note that {\bf .data} section symbols won't appear; the actual addresses
will show up.

The {\bf disassemble} command has several formats.  If you run it
without arguments, it will disassemble the code near your current
location.  You can also run it on any label in the {\bf .text} section,
e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) disas _start
\end{Verbatim}

and

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) disas main
\end{Verbatim}

In the latter case, where we have a function name, the entire function
will be disassembled.

You can also specify a range of memory addresses, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) disas 0x08048083 0x08048099
\end{Verbatim}

\section{Macros}

Subroutines---functions, at the C/C++ level---are really good for
writing one's code in a modular manner, which is a highly laudable goal.
However, they are costly in terms of performance.

A subroutine/function call involves one or more pushes onto the stack,
and it involves a jump.  The pushes take time, as does the jump.  This
causes a slowdown.  And since pushes and instruction fetches involve the
memory, the call increases the number of memory accesses, and thus
increases the chance of a cache miss, the latter being a big blow to
performance.

A {\bf macro} is like a subroutine in the sense that it makes one's
programming more modular, but without such overhead.  Instead, it is
merely yet another ``abbreviation'' for the clerk, albeit one that makes
our code modular.

To see what this means, consider this example from Chapter \ref{chap:asm}:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.data  
x:
      .long   1
      .long   5
      .long   2
      .long   18
sum:
      .long 0
.text  
.globl _start
_start:
      movl $4, %eax  
      movl $0, %ebx  
      movl $x, %ecx 
top:  addl (%ecx), %ebx
      addl $4, %ecx  
      decl %eax  
      jnz top  
done: movl %ebx, sum  
\end{Verbatim}

Those first three lines of executable code,

\begin{Verbatim}[fontsize=\relsize{-2}]
      movl $4, %eax  
      movl $0, %ebx  
      movl $x, %ecx 
\end{Verbatim}

perform various initializations.  To make our program more modular, we
could make a subroutine out of them, say as {\bf init()} in Section 
\ref{full} (either with or without the argument).  This would be fine,
but it would slow down the execution of the program a bit, as noted
above.

An alternative would be to convert those three instructions to a macro:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.macro init
      movl $4, %eax  
      movl $0, %ebx  
      movl $x, %ecx 
.endm

.data  
x:
      .long   1
      .long   5
      .long   2
      .long   18
sum:
      .long 0

.text  
.globl _start
_start:
      init
top:  addl (%ecx), %ebx
      addl $4, %ecx  
      decl %eax  
      jnz top  
done: movl %ebx, sum  
\end{Verbatim}

Though the ``init'' at {\bf \_start} has the look of a subroutine call,
it really isn't.  Instead, when the assembler sees this line at {\bf
\_start}, it replaces that line with the three lines of code defined for
``init'' at the top of the file.  In our definition of this macro {\bf
init}, starting with {\bf .macro} and ending with {\bf .endm}, we are
saying to the clerk, ``Clerk, if you ever see a line in my assembly file
here that consists of {\bf init}, treat that as if I had typed the
following code at that point...''

In other words, the macro version of the program will produce exactly
the same machine code as did the original nonmodular version.  We can
see this by running both files through {\bf as -a}; here are excerpts
from the outputs of this command from the original and macro versions of
the program:

\begin{Verbatim}[fontsize=\relsize{-2}]
...
  23                    _start:
  24 0000 B8040000            movl $4, %eax  
  24      00
  25 0005 BB000000            movl $0, %ebx  
  25      00
  26 000a B9000000            movl $x, %ecx 
  26      00
  27 000f 0319          top:  addl (%ecx), %ebx
...
\end{Verbatim}

\begin{Verbatim}[fontsize=\relsize{-2}]
...
  20                    _start:
  21 0000 B8040000            init
  21      00BB0000 
  21      0000B900 
  21      000000
  22 000f 0319          top:  addl (%ecx), %ebx
...
\end{Verbatim}

So, both versions of the program would produce the same {\bf .o} file.  By
contrast, the subroutine version produces different code:

\begin{Verbatim}[fontsize=\relsize{-2}]
...
  23                    _start:
  24 0000 E80E0000            call init
  24      00
  25 0005 0319          top:  addl (%ecx), %ebx
...
  31                    init:
  32 0013 B8040000            movl $4, %eax  
  32      00
  33 0018 BB000000            movl $0, %ebx 
  33      00
  34 001d B9000000            movl $x, %ecx
  34      00
  35 0022 C3                  ret
\end{Verbatim}

Note the machine code for the {\bf call} instruction and the {\bf ret}.
So, by using a macro we get the efficiency of the nonmodular version
while still getting a modular view for the programmer.\footnote{Though
this modular view will not appear in a debugging tool, a drawback.}

Macros allow arguments too.  Within a macro itself, the macro's arguments are rereferred to using backslashes.

For example, the macro {\bf yyy} has two arguments, {\bf u} and {\bf v}:

\begin{Verbatim}[fontsize=\relsize{-2}]
.macro yyy u,v
   movl \u,%eax
   addl \v,%eax
.endm
\end{Verbatim}

So the call

\begin{Verbatim}[fontsize=\relsize{-2}]
yyy $5,$12
\end{Verbatim}

will produce the same code as we had written

\begin{Verbatim}[fontsize=\relsize{-2}]
movl $5, %eax
addl $12, %eax
\end{Verbatim}

Similarly, the call

\begin{Verbatim}[fontsize=\relsize{-2}]
yyy %ebx,y
\end{Verbatim}

will produce the same code as we had written

\begin{Verbatim}[fontsize=\relsize{-2}]
movl %ebx, %eax
addl y, %eax
\end{Verbatim}

Note that here we could not even use a subroutine instead of the macro,
as the subroutine could not do ``substitution'' for us.

Since macros give us the modularizing benefit of subroutines without
their performance-inhibiting overhead, why just always use macros?
Here are some reasons:

\begin{itemize}

\item Macros make the program's executable file larger.  If we call the
same macro at many different points in a program, we get many copies of
the code.

\item Debugging tools usually do not handle macros well.

\item Macros cannot handle recursion.\footnote{Pointed out by Mikel
McDaniel. Of course, one can do recursion ``by hand,'' by managing a
stack with one's own code, but the point is that subroutines do this
automatically.}

\end{itemize}

You can write macros in C/C++ too, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
#define PRINT2INTS(i1,i2) (printf("%d %d\n",i1,i2))

main() 
{  PRINT2INTS(5,12); }
\end{Verbatim}

\section{Inline Assembly Code for C++}

The C++ language includes an {\bf asm} construct, which allows you to
embed assembly language source code right there in the midst of your C++
code.\footnote{This is, as far as I know, available in most C++
compilers.  Both GCC and Microsoft's C++ compiler allow it.}

This feature is useful, for instance, to get access to some of the fast
features of the hardware.  For example, say you wanted to make use of
Intel's fast MOVS string copy instruction.  You could write an assembly
language subroutine using MOVS and then link it to your C++ program, but
that would add the overhead of subroutine call/return.  Instead, you
could write the MOVS code there in your C++ source file.

Here's a very short, overly simple
example:

\begin{Verbatim}[fontsize=\relsize{-2}]
// file a.c

int x;

main()

{  scanf("%d",&x);
   __asm__("pushl x");
}
\end{Verbatim}

After doing

\begin{Verbatim}[fontsize=\relsize{-2}]
gcc -S a.c
\end{Verbatim}

the file {\bf a.s} will be

\begin{Verbatim}[fontsize=\relsize{-2}]
        .file   "a.c"
        .section        .rodata
.LC0:
        .string "%d"
        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $20, %esp
        movl    $x, 4(%esp)
        movl    $.LC0, (%esp)
        call    scanf
#APP
        pushl x
#NO_APP
        addl    $20, %esp
        popl    %ecx

\end{Verbatim}

Our assembly language is bracketed by APP and NO\_APP, and sure enough,
it is

\begin{Verbatim}[fontsize=\relsize{-2}]
pushl x
\end{Verbatim}

For an introduction to how to use this feature, see the tutorials on the
Web; just plug ``inline assembly tutorial'' into Google.  For instance,
there is one at \url{http://www.opennet.ru/base/dev/gccasm.txt.html}.


