\chapter{More on Intel Arithmetic and Logic Operations}
\label{chap:arithlog}

Note:  We will assume our machine has a 32-bit word size throughout
this chapter.

\section{Instructions for Multiplication and Division}

\subsection{Multiplication}

\subsubsection{The IMUL Instruction}

One of the Intel instructions for multiplication is {\bf imull} (``IMUL
long,'' i.e. for 32-bit operations).  One form of IMUL is

\begin{Verbatim}[fontsize=\relsize{-2}]
imull src
\end{Verbatim}

where {\bf src} is a 32-bit multiplier specified in register mode.  The
multiplicand is implicitly EAX,\footnote{So, there is only one operand
in this form of the instruction, as opposed to the two in the previous
form.} and the product is stored in what is described in Intel
literature as EDX:EAX.  To see what this means, note that when we
multiply two 32-bit quantities, we potentially get a 64-bit product.
So, the Intel hardware places the upper 32 bits in EDX and the lower 32
bits in EAX.

\subsubsection{Issues of Sign}

Note that even if the product fits into 32 bits, thus into EAX, we still
need to pay attention to EDX.  Suppose for instance that the product
happens to be -4.  In 32-bit form, that is 

\begin{Verbatim}[fontsize=\relsize{-2}]
11111111111111111111111111111100 
\end{Verbatim}

while in 64-bit form it is

\begin{Verbatim}[fontsize=\relsize{-2}]
1111111111111111111111111111111111111111111111111111111111111100 
\end{Verbatim}

In other words, the negative status of -4 is indicated by a 1 in the
first of those 64 bits.  The 64-bit number

\begin{Verbatim}[fontsize=\relsize{-2}]
0000000000000000000000000000000011111111111111111111111111111100 
\end{Verbatim}

is positive.  In fact, it has the value $2^{32}-4$, which you can
see because we wind backwards four times from

\begin{Verbatim}[fontsize=\relsize{-2}]
0000000000000000000000000000000100000000000000000000000000000000.
\end{Verbatim}

\checkpoint

\subsection{Division} 

\subsubsection{The IDIV Instruction}

For IDIV, one form is

\begin{Verbatim}[fontsize=\relsize{-2}]
idivl src
\end{Verbatim}

We place the dividend in EDX:EAX and the divisor in {\bf src}, again
with the latter being specified in register addressing mode.  The
quotient then comes out in EAX, and the remainder in EDX.  If the
quotient is too big for 32-bit storage, then an {\bf exception}, i.e. an
execution error will occur.

\subsubsection{Issues of Sign}

As noted for multiplication above, you must be careful with 64-bit
quantities.  Even if your dividend fits into 32 bits, it will be treated
as a 64-bit number, with the high bits being in EDX---whether you put
them there or not.  

So, if your dividend is nonnegative, make sure to put 0 in EDX;
otherwise put -1 there.

\checkpoint

A compact and fast way to do this is to use the CWD instruction.

\subsection{Example}

Here is an example of IMUL/IDIV:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.text
.globl _start
_start:
      movl $2, %ecx
      movl $-1, %eax
      imull %ecx
      idivl %ecx
done: movl %ebx, %ebx
\end{Verbatim}

Let's use GDB to explore the behavior of these instructions:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
gdb) r
Starting program: /www/matloff/public_html/50/PLN/a.out

Breakpoint 1, _start () at g.s:4
4             imull %ecx
Current language:  auto; currently asm
(gdb) p/x $eax
$1 = 0xffffffff
(gdb) p/x $edx
$2 = 0x0
(gdb) si
8             idivl %ecx
(gdb) p/x $eax
$3 = 0xfffffffe
(gdb) p/x $edx
$4 = 0xffffffff
(gdb) si
9       done: movl %ebx, %ebx
(gdb) p/x $eax
$5 = 0xffffffff
(gdb) p/x $edx
$6 = 0x0
\end{Verbatim}

We started with the 32-bit version of -1 in EAX, and 0 in EDX.
Multiplying by 2 gives us -2, but IMUL puts its product in 64-bit form
in the register pair EDX:EAX.  GDB verified this for us.  Then we
divided by 2, getting -1, but since IDIV puts its result in 32-bit form
in the single register EAX, EDX becomes 0.

\checkpoint

Or, if we want to view the numbers as unsigned---remember, the nice
thing about 2s complement is that we have our choice of
interpretation---then our multiplying by 2 changed $2^{32}-1$ (a number
fitting in 32 bits) to $2^{33}-2$ (a 64-bit number).

\section{More on Carry and Overflow, and More Jump Instructions}

In Chapter \ref{chap:asm}, we saw how to use instructions like JS, JNZ, etc.
to do ``if-then-else'' tests.  This works for simple cases with small or
moderate-sized numbers.  Unfortunately, it is more complicated than
this, though, due to problems with the difference between signed and
unsigned numbers.  To see why, suppose for convenience that we are
working with 1-byte quantities (other than convenience, there is nothing
special about using only one byte here), and consider what happens with 

\begin{Verbatim}[fontsize=\relsize{-2}]
cmpb %al, %bl
\end{Verbatim}

when c(AL) = 0x50 and c(BL) = 0xfb.  Viewed as signed numbers, c(BL) is
smaller, since $80 > -5$, but viewed as unsigned numbers, c(AL) is
smaller, since $80 < 251$.  

Recall that the hardware doesn't know whether we are considering our
numbers to be signed or unsigned.  So, for example, there is nothing in
the hardware for the CMP instruction to distinguish between the signed
and unsigned cases, nor is there such a capability in the ADD and SUB
instructions.\footnote{The MIPS chip, by contrast, does have, for
instance, both ADD and ADDU instructions for addition.  ADD causes an
{\bf exception}, i.e. execution error, upon overflow, while ADDU does
nothing to report the problem.  C/C++ compilers will use ADDU.}
However, the hardware does store relevant information in the Carry and
Overflow flags, and this will allow us as programmers to handle the
signed and unsigned cases differently on our own, as we will see below.  

Before continuing, let's review at what the hardware does when adding
two numbers.  It will in any addition simply add the two numbers bit by
bit sequentially from right to left, with carries, in grade-school
fashion.  The hardware is basically treating the numbers as unsigned,
but remember, that's OK even if we are considering the numbers to be
signed, because the nature of 2s complement storage makies it all
work out.

When performing an ADD instruction, the hardware will set or clear the
Carry flag (CF), according to whether a carry does occur out of the most
significant bit.  As noted in Chapter \ref{chap:asm}, we can
programmatically check this by using the JC or JNC instructions.
Similarly, the hardware will set or clear the Overflow flag (OF), under
certain conditions which we will discuss shortly; our program can check
this flag via the JO and JNO instructions.  

The term {\bf overflow}---as opposed to the Overflow flag---informally
means that the result of an operation is too large to fit into the
number of bits allocated to it.  The definition of ``too large'' in turn
depends on whether we are treating the numbers as signed or unsigned.  

Again taking the 8-bit case as our example, suppose c(AL) = 0x50 and
c(BL) = 0x50, i.e. both registers contain decimal 80.  Their sum, 160,
does fit into 8 bits if we are considering our numbers to be unsigned,
but if we are thinking of them as signed, then ``160'' is really -96.
So, in this example an overflow will not occur if the numbers are
considered unsigned, but will occur if they are considered signed.

Clearly, if we are considering our numbers to be unsigned, overflow is
indicated by the Carry Flag.  But what about the signed case?  The Carry
Flag will not tell us (at least not directly) whether overflow has
occurred.  In other words, we'll need another flag to check overflow in
the signed case, and that's the purpose of the Overflow Flag.

So, how did the Intel engineers design the circuitry which manages the
Overflow Flag?  Well, let's first note that there won't be an overflow
problem if we are adding a nonnegative number to a negative one.  The
nonnegative number will be in the range 0 to +127, while the negative
one will be between -128 and -1.  Then the sum must necessarily be
between -128 and +127, thus no overflow.

But a problem can occur if we are adding two positive numbers.  They are
basically 7 bits each (not counting the sign bit), so overflow will
occur if they have an 8-bit sum.  In such a case, there will be a carry
out of the 7th bit, thus placing a 1 in the 8th bit---thus producing a
negative number if we are considering everything signed.  If you step
through this reasoning in the case of adding two negative numbers,
you'll see the problem arises if their sum is positive.  In other words:

\begin{quote}
In signed addition, overflow occurs if and only if the sum of two
positive numbers is negative or the sum of two negative numbers is
positive.
\end{quote}

Knowing this, the developers of the Intel hardware designed the
circuitry so that the Overflow flag would be set if it sees that there
is a sign change of the nature described above, and clears the flag
otherwise.

Note once again that the hardware doesn't know whether we are
considering our numbers to be signed or unsigned.  The developers of the
hardware assumed that most people's typical usage would be signed, and
thus they designed the Overflow flag accordingly.  If we really are
treating our numbers as unsigned, then we would have our program check
the Overflow flag.  But if we are treating them as unsigned, we would
check the Carry flag.

\checkpoint

What about subtraction?  Recall that it is implemented via addition,
i.e. x-y is effected by adding the 2s complement representation of -y to
x, so we are really back to the addition case.  But if we are using
subtraction to set up a conditional jump---typically by using a CMP
instruction---we need to worry about the sign flag too, as follows.

\begin{itemize}

\item If the Overflow flag is 0, we have nothing to worry about, and can
simply look at the Sign flag to determine whether the subtraction
resulted in a negative number.

\item If the Overflow is 1, things are a little less simple.  Recall that
when overflow occurs in an addition (which, recall, is performed for the
subtraction process), the sign of the sum is opposite to that of the two
addends.  In other words, an overflow addition changes signs.  So, if
the subtraction had resulted in a negative difference, which ordinarily
would set the Sign flag, the overflow will make the Sign flag 0 instead.
Conversely, if the difference were not negative, the overflow will make
the Sign flag 1, falsely making the difference look negative.

\end{itemize}

From these considerations, we see that:

\begin{quote}
In a subtraction of signed numbers x-y, y will be the larger of the two
if and only if the OF $\neq$ SF.
\end{quote}

For this reason, the developers of the Intel chip including the JL
(``jump if less than'') which jumps if OF $\neq$ SF.  This is then
the safe way to do less-than comparisons on signed numbers; JS is not
safe unless you know your numbers are small enough that overflow won't
occur.  There are also others.  Here is an incomplete list (here ``dst''
and ``src'' refer to the destination and source done, for example, in an
immediately-previous CMP):

\begin{tabular}{|l|l|l|l|}
\hline
mnemonic & action & flags checked \\ \hline 
\hline
JL & jump if dst $<$ src & SF $\neq$ OF \\ \hline 
JLE & jump if dst $\leq$ src & SF = 1 or SF $\neq$ OF \\ \hline 
JG & jump if dst $>$ src & ZF = 0 and SF = OF \\ \hline 
JGE & jump if dst $\geq$ src & SF = OF \\ \hline 
\end{tabular} 

There are corresponding instructions for the unsigned case:  JB (``jump
if below''), JBE (``jump if below or equal''), JA (``jump if above''),
and JAE (``jump if above or equal''),

For {\bf imull}, if the product cannot fit into 32 bits and the high
bits go into EDX, both CF and OF are set.  (SF and ZF are undefined.)
With {\bf idivl}, all of these flages are undefined.

\section{Logical Instructions}

This section will illustrate one of the most important tools available
to the assembly language programmer---data operations at the bit level.
Keep these in mind in all your assembly language programming.

In Chapter \ref{chap:asm}, we saw how to access 16-bit and 8-bit
operands.  But what can we do in smaller cases, for example 1-bit
operands?  The answer is that we cannot do this in a manner directly
analogous to the 16-bit and 8-bit cases.  The reason for this is that
although individual bytes have addresses, individual bits do not.  Thus
for example there is no 1-bit MOV instruction comparable to {\bf movl}
and {\bf movb}.

However, we can access individual bits, or groups of bits, in an
indirect manner by using the instructions introduced in this section.

A logical AND operation does the following:  At each bit position in the
two operands, the result is 1 if both operand bits are 1, and 0
otherwise.

For example,

\begin{tabular}{rr}
~ & 1011 \\ 
AND & 1101 \\ 
\hline 
~ & 1001 \\ 
\end{tabular}

In the leftmost bit position of the two operands here, we see two 1s, so
there is a 1 in the result.  In the second-to-the left position, there is
a 0 in the first operand and 1 in the second, so there is a 0 in the
result, etc.

If we AND together x and y, placing the result back in y, we zero out
every bit of y except those corresponding to 1s in x; those latter bits
of y are left intact.

Suppose for instance that we wish to put a 0 in bit position 1 (second
from the right) in EAX, but keep all the other bits in EAX unchanged.
We would use the instruction 

\begin{Verbatim}[fontsize=\relsize{-2}]
andl $0xfffffffd, %eax
\end{Verbatim}

because 0xfffffffd has a 0 at bit position 2 and all 1s elsewhere.

Often an AND operation is used as a {\bf mask}, which is a mechanism by
which we can test for certain bits to have certain values.  For example,
suppose we wish to jump to a location {\bf w} if bit positions 15 and 4
have the values 1 and 0, respectively, in ECX.  We could use the
following code to do this:

\begin{Verbatim}[fontsize=\relsize{-2}]
andl $0x00008010, %ecx
cmpl $0x00008000, %ecx
jz w
\end{Verbatim}

The point is that after the {\bf andl} is executed, c(ECX) will equal
0x00008000 if and only if the original value in ECX had 1 and 0 in bit
positions 15 and 4.

The \& operator in C (the binary operator, not the ``address of''
operator) performs an AND operation, and in fact the compiler will
probably translate a statement containing \& into and {\bf andl}
instruction.  

In an OR operation, if at least one of the two corresponding bits in the
two operands is a 1, then the bit in the result is a 1, while that bit is
a 0 otherwise.

The {\bf orl} (``OR long'') instruction performs an OR operation on the
source and destination, placing the result back in the destination.
Again, it is important to note that that means that for every bit in the
source operand which is a 0, the corresponding bit in the destination
remains unchanged, while every 1 in the source will cause the
corresponding bit in the destination to become 1.  In other words, {\bf
orl} is used to place 1s in desired spots within the destination while
keeping the other destination bits unchanged.

So for example to set bit 2 to 1 in EAX, we would use the instruction

\begin{Verbatim}[fontsize=\relsize{-2}]
orl $0x00000004, %eax
\end{Verbatim}

The $|$ operator in C performs a logical-or operation, and the compiler
will usually translate it to {\bf orl} or other instruction from the OR
family.

Logical operations are quite prominent in the programming of device
drivers.  For example, the built-in speaker on a PC has a 1-byte {\bf
port} at address 0x61.  The least-significant 2 bits must be set to 11
for the speaker to be on, 00 for off; the other bits contain other
important information and must not be changed.  I/O ports, as we will
see later, are accessed on Intel machines via Intel's IN and OUT
instructions.\footnote{Accessing them requires status as privileged
user.}  The following code turns the PC speaker off and on:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
# to turn speaker on:
# copy speaker port to AL register
inb $0x61,%al  
# place 11 in last 2 bits of AL
orb $3,%al  
# copy back to speaker port
outb %al,$0x61

# to turn speaker off:
# copy speaker port to AL register
inb $0x61,%al
# place 0s in last 2 bits of AL
andb $0xfc,%al  
# copy back to speaker port
outb %al,$0x61
\end{Verbatim}

\checkpoint

The NOT instruction (e.g. {\bf notl}) changes all 1 bits to 0s and all 0
bits to 1s.

Each bit position in the result of the XOR (``exclusive or'')
instruction is equal to 1 if and only if exactly one of the operands has
a 1 at the same bit position.\footnote{You can also think of it as
adding two bits and taking the sum modulo 2.}  It is a classic way of
zeroing-out a register using a very short instruction.  For example,

\begin{Verbatim}[fontsize=\relsize{-2}]
xorl %ecx, %ecx
\end{Verbatim}

will make all bits of ECX equal to 0.

There are also {\bf shift} operations, which move all bits in the
operand to the left or right.  The number of bit positions to shift
can be specified either in an immediate constant, or in the CL
register.\footnote{Recall that this is the lower 8 bits of ECX.}

For example,

\begin{Verbatim}[fontsize=\relsize{-2}]
shll $4, %eax
\end{Verbatim}

would move all bits in EAX leftward by 4 positions.  Note that those which
move past the left end are lost, and the bit positions vacated near the
right end become 0s.  So, for instance, if c(EAX) had been 0xd5fffff3,
after the shift it would be 0x5fffff30.  

The {\bf shrl} instruction does the same thing, but rightward.

Note that if we are considering the contents of the operand to be an
unsigned integer, a left shift by 4 bits is equivalent to multiplication
by $2^4 = 16$, and a right shift by 4 bits is equivalent to division by
16.  

If on the other hand we are considering the operand to be a signed
integer, we would use {\bf sall} and {\bf sarl}, since these preserve
signs.  It is not an issue for multiplication (except for overflow), 
but it makes a difference for division.

This difference is summarized in the comments in the following code:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.text  
.globl _start
_start:
   movl $0x80000000,%eax
   movl $0x80000000,%ebx
   shrl $1,%eax
   sarl $1,%ebx
\end{Verbatim}

The point is that though EAX starts out as a negative number, due to
having a 1 in its leftmost bit, shifting EAX one bit rightward will
leave a 0 in the leftmost bit, thus rendering EAX positive.  In other
words, dividing this negative number by 2 (which is what a right-shift
of one bit does), which should produce another negative number, instead
produces a positive number.  (Of course, if we are interpreting the
contents of EAX to be unsigned, there is no problem.)

On the other hand, the SAR family retains the original sign.  After the
right-shift is performed, leaving the leftmost bit a 0, SAR replaces
that bit with a 1 if the original bit there had been a 1. 

Let's execute this code under GDB:

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) b 9
Breakpoint 1 at 0x804805e: file y.s, line 9.
(gdb) r
Starting program:
/fandrhome/matloff/public_html/matloff/public_html/50/PLN/y

Breakpoint 1, _start () at y.s:9
9          shrl $1,%eax
Current language:  auto; currently asm
(gdb) i r
eax            0x80000000       -2147483648
ebx            0x80000000       -2147483648
...
(gdb) n
_start () at y.s:10
10         sarl $1,%ebx
(gdb) n
done () at y.s:11
11      done: movl %ecx,%ecx
(gdb) i r
eax            0x40000000       1073741824
ebx            0xc0000000       -1073741824
...
\end{Verbatim}

Note both the hex and decimal contents.

In C/C++, left and right shift operations are performed via $<<$ and
$>>$.  In the latter case, the compiler will translate to {\bf shrl}
if the operand is {\bf unsigned}, and {\bf sall} in the signed case. 

\section{Floating-Point}

Modern Intel CPUs have a built-in Floating-Point unit, which is hardware
to perform floating-point add, subtract and so on.  Without this
hardware, the operations must be performed in software.  For example,
consider the code

\begin{Verbatim}[fontsize=\relsize{-2}]
float u,v,w;
...
w = u + v;
\end{Verbatim}

With an Intel CPU, the compiler would implement the addition by
generating an FADD (``floating-point add'') instruction.  If there were
no such instruction, the compiler would have to generate a rather length
amount of code in which one would first decompose {\bf u} and {\bf v}
into their mantissa and exponent portions, would then change one of the
addends in order to match exponent values, would then add the mantissas,
etc.  The resulting code would run much more slowly would a single FADD
instruction.

Intel CPUs have special floating-point registers, each 80 bits wide,
thus giving more accuracy than the 32-bit or even 64-bi size.  In ATT\&T
syntax, the registers are named {\bf \%st(0)} (also called simply {\bf
\%st}), {\bf \%st(1)}, {\bf \%st(2)} and so on.  The ``st'' here stands
for ``stack,'' as the registers are indeed arranged as a {\bf stack}.

We will learn about stacks in detail in Chapter \ref{chap:sub}, but
the main point is that a stack is a ``last-in, first-out'' data
structure, meaning that items are removed from the stack in reverse
order from which they are inserted.  For instance, say the stack is
initially empty, and then we {\bf push}, i.e. insert, the numbers
1.8, 302.5 and -29.5222 onto the stack.  We say that -29.5222 is now at
the {\bf top} of the stack, followed by 302.5 and then 1.8.  If we now
{\bf pop} the stack, that means removing the top element, -29.5222, so
that the new top is 302.5, with 1.8 next.  If we pop again, the stack
consists only of 1.8.  If we now push 168.8888, then that latter value
is on top, with 1.8 next.

Here is some sample code:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
.data  
x:     .float 1.625
y:     .float 0.0
point25:
       .float 0.25
.text  
.globl _start
_start:
   flds x  # push x onto f.p. stack; x now on top
   flds point25  # push 0.25; 0.25 now on top, followed by x
   faddp  # add the top 2 elts. of the stack, replace 2nd by sum,
          # then pop once; x+0.25 now on top
   fstps y  # pop the stack and store the popped value in y
\end{Verbatim}

Here we see a new assembler directive, {\bf .float}.  This tells the
assembler to arrange things so that when this program is run, the
initial value at {\bf x} in memory will be the (32-bit) representation
of the number 1.625.  From Chapter \ref{chap:datarep}, we know that
this is the bit pattern 0x3fd00000.  In fact, we would get the same
result from

\begin{Verbatim}[fontsize=\relsize{-2}]
x:     .long 0x3fd00000
\end{Verbatim}

and would even save the assembler some work in the process, since it
would be spared the effort of converting 1.625.\footnote{That effort
would be the same as what {\bf scanf()} would go through with {\bf \%f}
format:  It would count the digits after the decimal point, finding the
number to be 3.  It would then convert the characters `1', `6' etc. to 1, 6
and so on, and thus form the numbers 1 and 625.  Finally it would do
something like {\tt a = 1 + 625.0/1000.0}.}

There are two {\bf flds} instructions.  They are from the FLD
(``floating-point load'') Intel family; the `s' here means ``stack,''
just as `l' means ``long'' in {\bf addl} from the ADD family.

Read the comments in the code before continuing.

OK, let's use GDB to learn more about this code:

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) l
1       .data
2       x:     .float 1.625
3       y:     .float 0.0
4       point25:
5              .float 0.25
6       .text
7       .globl _start
8       _start:
9          flds x  # push x onto f.p. stack
10         flds point25  # push 0.25
(gdb)
11         faddp
12         fstps y
(gdb) b 12
Breakpoint 1 at 0x8048082: file fp.s, line 12.
(gdb) r
Starting program:
/fandrhome/matloff/public_html/matloff/public_html/50/PLN/fp

Breakpoint 1, _start () at fp.s:12
12         fstps y
Current language:  auto; currently asm
(gdb) nexti
0x08048088 in ?? ()
(gdb) p/f y
$1 = 1.875
(gdb) p/x y
$2 = 0x3ff00000
\end{Verbatim}

Nothing new here, except that we've used GDB's {\bf p} (``print'')
command with {\bf f} format.  Obviously, the latter is similar to {\bf
\%f} for {\bf printf()}.  We see that {\bf y} comes out to 1.875, as
expected.  We also print out the value of {\bf y} as a bit string,
which turns out to be 0x3ff00000.  You should check that that is indeed
the IEEE format representation of 1.875.

\checkpoint

Of course, there are also the obvious instruction families such as FSUB
for subtraction, FMUL for multiplication and so on.  But there are also
instruction families that implement transcendental function operations,
such as FSIN for sine, FSQRT for square root, etc.  Again, by
implementing these operations in hardware, we can get large speedups in
C/C++ programs that make heavy use of {\bf float} variables, such as
programs that do graphics, games, scientific computation and so on.

The Intel hardware also gives the programmer the ability to have fine
control over issues such as rounding, via special instructions that
control {\bf floating-point status registers}.


