\chapter{Major Components of Computer {}``Engines{}''}

\section{Introduction}

Recall from Chapter 0 that we discussed the metaphor of the {}``engine{}''
of a computer. This engine has two main components:

\begin{itemize}

\item the hardware, including the central processing unit (CPU) which
executes the computer's machine language, and

\item the low-level software, consisting of various services that the
operating system makes available to programs. 

\end{itemize}

This chapter will present a broad overview of both components of this engine.
The details will then unfold in the succeeding chapters.

One of the major goals of this chapter, and a frequent theme in the
following chapters, is to develop an understanding of the functions of
these two components.  In particular, questions which will be addressed
both in this chapter and in the chapters which follow concern the
various functions of computer systems:

\begin{itemize}

\item What functions are typically implemented in hardware?

\item What functions are typically implemented in software?

\item For which functions are both hardware and software implementations
common? Why is hardware implementation generally faster, but software
implementation more flexible? 

\end{itemize}

Related to these questions are the following points, concerning the
\textbf{portability} of a program, i.e. the ability to move a program
developed under one computing environment to another. What dependencies,
if any, does the program have to that original environment? Potential
dependencies are:

\begin{itemize}

\item Hardware dependencies:

A program written in machine language for an older PC's Intel 80386
CPU\footnote{Or clones of Intel chips, such as those by AMD.  Throughout
this section, our term {\it Intel} will include the clones.} certainly
won't run on Motorola's PowerPC CPU.\footnote{This is what the Apple
Macintosh used until 2005.} But that same program \textit{will} run on
PC using the Intel Pentium, since the Intel family is designed to be
\textbf{upward compatible}, meaning that programs written for earlier
members of the Intel CPU family are guaranteed to run on later members
of the family (though not vice versa).

\item Operating system (OS) dependencies:

A program written to work on a PC under the Windows OS will probably not
run on the same machine under the Linux OS (and vice versa), since the
program will probably call OS functions. 

\end{itemize}

And finally, we will discuss one of the most important questions of all: What
aspects of the hardware and software components of the engine determine the
overall speed at which the system will run a given program?


\section{Major Hardware Components of the Engine}


\subsection{System Components}

A block diagram of the basic setup of a typical computer system appears
in Figure \ref{system}.   

\begin{figure}[tb]
\centerline{
\includegraphics[width=3.6in]{Images/Fig21.jpg}
}
\caption{system structure}
\label{system}
\end{figure}

The major components are as follows:

\textbf{CPU}

As mentioned earlier, this is the \textbf{central processing unit},
often called simply the \textbf{processor}, where the actual execution of a
program takes place. (Since only machine language programs can execute on a
computer, the word \textit{program} will usually mean a machine language program.
Recall that we might write such a program directly, or it might be produced
indirectly, as the result of compiling a source program written in a high-level
language (HLL) such as C.)

\textbf{Memory}

A program's data and machine instructions are stored here during the
time the program is executing. Memory consists of cells called
\textbf{words}, each of which is identifiable by its \textbf{address}. 

If the CPU fetches the contents of some word of memory, we say that the
CPU \textbf{reads} that word. On the other hand, if the CPU stores a
value into some word of memory, we say that it \textbf{writes} to that
word. Reading is analogous to watching a video cassette tape, while
writing is analogous to recording onto the tape.

Ordinary memory is called \textbf{RAM}, for Random Access Memory, a term
which means that the access time is the same for each word.\footnote{
In the market for personal computers, somehow the word {}``memory{}''
has evolved to mean disk space, quite different from the usage here. In
that realm, what we refer to here as {}``memory{}'' is called
{}``RAM.{}'' The terms used in this book are the original ones, and are
standard in the computer industry, for instance among programmers and
computer hardware designers. However, the reader should keep in mind
that the terms are used differently in some other contexts.  } There is
also \textbf{ROM} (Read-Only Memory), which as its name implies, can be
read but not written. ROM is used for programs which need to be stored
permanently in main memory, staying there even after the power is turned
off. For example, an autofocus camera typically has a computer in it,
which runs only one program, a program to control the operation of the
camera. Think of how inconvenient---to say the least---it would be if
this program had to be loaded from a disk drive everytime you took a
picture! It is much better to keep the program in ROM.\footnote{Do not
confuse ROM with CD-ROMs, in spite of the similar names.  A ROM is a
series of words with addresses within the computer's memory space, just
like RAM, whereas a CD-ROM is an input/output device, like a keyboard or
disk.}

\textbf{I/O Devices}

A typical computer system will have several \textbf{input/output}
devices, possibly even hundreds of them (Figure \ref{system} shows two
of them). Typical examples are keyboards/monitor screens, floppy and
fixed disks, CD-ROMs, modems, printers, mice and so on.

Specialized applications may have their own special I/O devices. For example,
consider a vending machine, say for tickets for a regional railway system such
as the San Franciso Bay Area's BART, which is capable of accepting dollar bills.
The machine is likely to be controlled by a small computer. One of its input
devices might be an optical sensor which senses the presence of a bill, and
collects data which will be used to analyze whether the bill is genuine. One
of the system's output devices will control a motor which is used to pull in
the bill; a similar device will control a motor to dispense the railway ticket.
Yet another output device will be a screen to give messages to the person buying
the ticket, such as {}``please deposit 25 cents more.{}''

The common feature of all of these examples is that they serve as interfaces
between the computer and the {}``outside world.{}'' Note that in all cases,
they are communicating with a \textit{program} which is running on the computer.
Just as you have in the past written programs which input from a keyboard and
output to a monitor screen, programs also need to be written in specialized
applications to do input/output from special I/O devices, such as the railway
ticket machine application above. For example, the optical sensor would collect
data about the bill, which would be input by the program. The program would
then analyze this data to verify that the bill is genuine.

Note carefully the difference between memory and disk.\footnote{Often
confused by the fact that salespeople at computer stores erroneously
call disk ``memory.''}  Memory is purely electronic, while disk is
electromechanical, the latter referring to the fact that a disk is a
rotating platter.  Both memory and disk store bits, but because of the
mechanical nature of disk, it is very much slower than memory---memory
access is measured in nanoseconds (billionths of seconds) while disk
access is measured in milliseconds (thousandths of seconds).  The good
thing about disk is that it is nonvolatile, so we can store files
permanently.\footnote{We couldn't do that with ROM, since we want to be
able to modify the files.}  Also, disk storage is a lot cheaper, per bit
stored, than memory.

\textbf{System Bus}

A \textbf{bus} is a set of parallel wires (usually referred to as
{}``lines{}''), used as communication between components. Our
\textbf{system bus} plays this role in Figure \ref{system}--- he CPU
communicates with memory and I/O devices via the bus. It is also
possible for I/O devices to communicate directly with memory, an action
which is called \textbf{direct memory access} (DMA), and again this is
done through the bus.\footnote{This is the basic view, but technically
it applies more to older or smaller computers.  In a PC, for instance,
there will be extra chips which serve as ``agents'' for the CPU's
requests to memory.  For a description of how some common chips like
this work, and their implications for software execution speed, see
Chapter 5 of {\it Code Optimization:  Effective Memory Usage}, by Kris
Kaspersky, A-LIST, 2003.}

The bus is broken down into three sub-buses:

\begin{itemize}
\item \textbf{Data Bus:}


As its name implies, this is used for sending data. When the CPU reads a memory
word, the memory sends the contents of that word along the data bus to the CPU;
when the CPU writes a value to a memory word, the value flows along the data
bus in the opposite direction.

Since the word is the basic unit of memory, a data bus usually has as many lines
as there are bits in a memory word. For instance, a machine with 32-bit word
size would have a data bus consisting of 32 lines.

\item \textbf{Address Bus:}


When the CPU wants to read or write a certain word of memory, it needs to have
some mechanism with which to tell memory \textit{which} word it wants to read
or write. This is the role of the address bus. For example, if the CPU wants
to read Word 504 of memory, it will put the value 504 on the address bus, along
which it will flow to the memory, thus informing memory that Word 504 is the
word the CPU wants.

The address bus usually has the same number of lines as there are bits in the
computer's addresses.

\item \textbf{Control Bus:}


How will the memory know whether the CPU wants to read or write? This is one
of the functions of the control bus. For example, the control bus in typical
PCs includes lines named MEMR and MEMW, for {}``memory read{}'' and {}``memory
write.{}'' If the CPU wants to read memory, it will \textbf{assert} the MEMR
line, by putting a low voltage on it, while for a write, it will assert MEMW.
Again, this signal will be noticed by the memory, since it too is connected
to the control bus, and so it can act accordingly. 

\end{itemize} As an example, consider a machine with both address and
word size equal to 32 bits. Let us denote the 32 lines in the address
bus as \( A_{31} \) through \( A_{0} \), corresponding to Bits 31
through 0 of the 32-bit address, and denote the 32 lines in the data bus
by \( D_{31} \) through \( D_{0} \), corresponding to Bits 31 through 0
of the word being accessed. Suppose the CPU executes an instruction to
fetch the contents of Word 0x000d0126 of memory. This will involve the
CPU putting the value 0x000d0126 onto the address bus. Remember, this is
hex notation, which is just a shorthand abbreviation for the actual
value,

\begin{verbatim}

000000000000000011010000000100100110
\end{verbatim}

So, the CPU will put 0s on lines \( A_{31} \) through \( A_{20} \), a 1
on Line \( A_{19} \), a 1 on Line \( A_{18} \), a 0 on Line \( A_{17}
\), and so on. At the same time, it will assert the MEMR line in the
control bus. The memory, which is attached to these bus lines, will
sense these values, and {}``understand{}'' that we wish to read Word
0x000d0126. Thus the memory will send back the contents of that word on
the data bus. If for instance c(0x000d0126) = 0003, then the memory will
put 0s on Lines \( D_{31} \) through \( D_{2} \), and 1s on Lines \(
D_{1} \) and \( D_{0} \), all of which will be sensed by the CPU.

Some computers have several buses, thus enabling more than one bus transaction
at a time, improving performance.


\subsection{CPU Components}

\subsubsection{Intel/Generic Components}

We will now look in more detail at the components in a CPU. Figure
\ref{internal} shows the components that make up a typical CPU. Included
are an \textbf{arithmetic and logic unit} (ALU), and various
\textbf{registers}.

\begin{figure}[tb]
\centerline{
\includegraphics[width=4.2in]{Images/Fig22.jpg}
}
\caption{CPU internals}
\label{internal}
\end{figure}

The ALU, as its name implies, does arithmetic operations, such as addition,
subtraction, multiplication and division, and also several \textbf{logical}
operations. The latter category of operations are similar to the \verb-&&-,
\verb-||- and ! operators in the C language) used in logical expressions such
as

\begin{verbatim}

if (a < b && c == 3) x = y;
\end{verbatim}

The ALU does not \textit{store} anything. Values are input to the ALU, and results
are then output from it, but it does not store anything, in contrast to memory
words, which do store values. An analogy might be made to telephone equipment.
A telephone inputs sound, in the form of mechanical vibrations in the
air, and converts the sounds to electrical pulses to be sent to the
listener's phone, but it does not store these sounds. A telephone
tape-recording answering machine, on the other hand, does store the
sounds which are input to it. 

Registers are storage cells similar in function to memory words. The
number of bits in a register is typically the same as that for memory
words. We will even use the same c( ) notation for the contents of a
register as we have used for the contents of a memory word. For example,
c(PC) will denote the contents of the register PC described below, just
as, for instance, c(0x22c4) means the contents of memory word 0x22c4.
Keep in mind, though, that registers are not in memory; they are inside
the CPU. Here are some details concerning the registers shown in Figure
\ref{internal}. 

\begin{itemize}

\item \textbf{PC:} This is the \textbf{program counter}. Recall that a
program's machine instructions must be stored in memory while the
program is executing. The PC contains the address of the currently
executing instruction.  The term {\it PC} is a generic term; on Intel,
the register is called EIP (Extended Instruction Pointer).\footnote{The
registers discussed here and below are for 32-bit machines.  The 64-bit
versions use R instead of E, e.g. RAX instead of EAX.}

\item \textbf{ESP:} The \textbf{stack pointer} contains the address of
the {}``top{}'' of a certain memory region which is called the
\textbf{stack}. A stack is a type of data structure which the machine
uses to keep track of function calls and other information, as we will
see in Chapter 5. 

\item {\bf EAX, EBX, etc.:}  These are {\bf data registers}, used for
temporary storage of data.

\item \textbf{EFLAGS:} The \textbf{processor status register} (PS), called
EFLAGS on Intel machines, contains miscellaneous pieces of information,
including the \textbf{condition codes}. The latter are indicators of
information such as whether the most recent computation produced a
negative, positive or zero result.  Note that there are wires leading
out of the ALU to the EFLAGS (shown as just one line in Figure
\ref{internal}). These lines keep the condition codes up to date. Each
time the ALU is used, the condition codes are immediately updated
according to the results of the ALU operation.

Generally the PS will contain other information in addition to condition
codes.  For example, it was mentioned in MIPS and PowerPC processors
give the operating system a choice as to whether the machine will run in
big-endian or little-endian mode. A bit in the PS will record which mode
is used.

\item \textbf{MAR:} The \textbf{memory address register} is used as the
CPU's connection to the address bus. For example, if the currently
executing instruction needs to read Word 0x0054 from memory, the CPU
will put 0x0054 into MAR, from which it will flow onto the address bus.
\item \textbf{MDR:} The \textbf{memory data register} is used as the
CPU's connection to the data bus. For example, if the currently
executing instruction needs to read Word 0x0054 from memory, the memory
will put c(0x0054) onto the data bus, from which it will flow into the
MDR in the CPU. On the other hand, if we are writing to Word 0x0054, say
writing the value 0x0019, the CPU will put 0x0019 in the MDR, from which
it will flow out onto the data bus and then to memory.  At the same
time, we will put 0x0054 into the MAR, so that the memory will know to
which word the 0x0019 is to be written.  

\item \textbf{IR:} This is the \textbf{instruction register}. When the
CPU is ready to start execution of a new instruction, it fetches the
instruction from memory.  The instruction is returned along the data
bus, and thus is deposited in the MDR. The CPU needs to use the MDR for
further accesses to memory, so it enables this by copying the fetched
instruction into the IR, so that the original copy in the MDR may be
overwritten.  

\end{itemize} 

A note on the sizes of the various registers: The PC, ESP, and MAR all
contain addresses, and thus typically have sizes equal to the address
size of the machine.  Similarly, the MDR typically has sizes equal to
the word size of the machine. The PS stores miscellaneous information,
and thus its size has no particular relation to the machine's address or
word size. The IR must be large enough to store the longest possible
instruction for that machine.

A CPU also has internal buses, similar in function to the system bus,
which serve as pathways with which transfers of data from one register
to another can be made. Figure \ref{internal} shows a CPU having only
one such bus, but some CPUs have two or more. Internal buses are beyond
the scope of this book, and thus any reference to a {}``bus{}'' from
this point onward will mean the system bus. 

The reader should pay particular attention to the MAR and MDR. They will
be referred to at a number of points in the following chapters, both in
text and in the exercises---not because they are so vital in their own
right, but rather because they serve as excellent vehicles for
clarifying various concepts that we will cover in this book. In
particular, phrasing some discussions in terms of the MAR and MDR will
clarify the fact that some CPU instructions access memory while others
do not.

Again, the CPU structure shown above should only be considered {}``typical,{}''
and there are many variations. RISC CPUs, not surprisingly, tend to be somewhat
simpler than the above model, though still similar.


\subsubsection{History of Intel CPU Structure}

The earliest widely-used Intel processor chip was the 8080.  Its word
size was 8 bits, and it included registers named A, B, C and D (and a
couple of others).  Address size was 16 bits.

The next series of Intel chips, the 8086/8088,\footnote{The first IBM PC
used the 8088.} and then the 80286, featured 16-bit words and 20-bit
addresses.  The A, B, C and D registers were accordingly extended to
16-bit size and renamed AX, BX, CX and DX (`X' stood for ``extended'').
Other miscellaneous registers were added.  The lower byte of AX was
called AL, the higher byte AH, and similarly for BL, BH, etc.

Beginning with the 80386 and extending to the Pentium series, both word
and address size were 32 bits.  The registers were again extended in
size, to 32 bits, and renamed EAX, EBX and so on (`E' for ``extended'').
The 64-bit generation is now commonplace.

The pre-32-bit Intel CPUs, starting with 8086/8088, replaced the
\textit{single} register PC with a \textit{pair} of registers, CS (for
\textit{code segment}) and IP (for \textit{instruction pointer}). A
rough description is that the CS register pointed to the \textbf{code
segment}, which is the place in memory where the program's instructions
start, and the IP register then specified the \textit{distance} in bytes
from that starting point to the current instruction. Thus by combining
the information given in c(CS) and c(IP), we obtained the absolute
address of the current instruction.

This is still true today when an Intel CPU runs in 16-bit mode, in
which case it generates 20-bit addresses. The CS register is only 16
bits, but it represents a 20-bit address whose least significant four
bits are implicitly 0s. (This implies that code segments are allowed to
begin only at addresses which are multiples of 16.) The CPU generates
the address of the current instruction by concatenating c(CS) with four
0 bits\footnote{This is one 0 hex digit. Note too that this
concatenation is equivalent to multiplying by 16.  } and then adding the
result to c(IP).

Suppose for example the current instruction is located at 0x21082, and
the code segment begins at 0x21040. Then c(CS) and c(IP) will be 0x2104
and 0x0042, respectively, and when the instruction is to be executed,
its address will be generated by combining these two values as shown
above.

The situation is similar for stacks and data. For example, instead of
having a single SP register as in our model of a typical CPU above, the
earlier Intel CPUs (and current CPUs when they are running in 16-bit
mode) use a pair of registers, SS and SP. SS specifies the start of the
\textbf{stack segment}, and SP contains the distance from there to the
current top-of-stack.  For data, the DS register points to the start of
the \textbf{data segment}, and a 16-bit value contained in the
instruction specifies the distance from there to the desired data item.

Since IP, SP and the data-item distance specified within an instruction
are all 16-bit quantities, it follows that the code, stack and data
segments are limited to \( 2^{16}=65,536 \) bytes in size. This can make
things quite inconvenient for the programmer. If, for instance, we have
an array of length, say, 100,000 bytes, we could not fit the array into
one data segment. We would need two such segments, and the programmer
would have to include in the program lines of code which change the
value of c(DS) whenever it needs to access a part of the array in the
other data segment.

These problems are avoided by the newer operating systems which run on
Intel machines today, such as Windows and Linux, since they run in
32-bit mode.  Addresses are also of size 32 bits in that mode,
and IP, SP and data-item distance are 32 bits as well. Thus a code
segment, for instance, can fill all of memory, and segment switching as
illustrated above is unnecessary.

This continues to be the case for 64-bit machines, e.g. with 64-bit
addresses.

\subsection{The CPU Fetch/Execute Cycle}

After its power is turned on, the typical CPU pictured in Figure
\ref{internal} will enter its \textbf{fetch/execute cycle}, repeatedly
cycling through these three steps, i.e. Step A, Step B, Step C, Step A,
Step B, and so on:

\begin{itemize}

\item \textbf{Step A:} The CPU will perform a memory read operation, to
fetch the current instruction. This involves copying the contents of the
PC to the MAR, asserting a memory-read line in the control bus, and then
waiting for the memory to send back the requested instruction along the
data bus to the MDR. While waiting, the CPU will update the PC, to point
to the following instruction in memory, in preparation for Step A in the
next cycle.

\item \textbf{Step\ B:} The CPU will copy the contents of MDR to IR, so
as to free the MDR for further memory accesses. The CPU will inspect the
instruction in the IR, and \textbf{decode} it, i.e. decide what kind of
operation this instruction performs---addition, subtraction, jump, and
so on.

\item \textbf{Step C:} Here the actual execution of the operation
specified by the instruction will be carried out. If any of the operands
required by the instruction are in memory, they will be fetched at this
time, by putting their addresses into MAR and waiting for them to arrive
at MDR. Also, if this instruction stores its result back to memory, this
will be accomplished by putting the result in MDR, and putting into MAR
the memory address of the word we wish to write to. 

\end{itemize}

After Step C is done, the next cycle is started, i.e. another Step A,
then another Step B, and so on. Again, keep in mind that the CPU will
continually cycle through these three steps as long as the power remains
on. Note that all of the actions in these steps are functions of the
\textit{hardware}; the circuitry is designed to take these actions,
while the programmer merely takes advantage of that circuitry, by
choosing the proper machine instructions for his/her program.

A more detailed description of CPU operations would involve specifying
its \textbf{microsteps}.  These are beyond the scope of this book, and
the three-step cycle described above will give sufficient detail for our
purposes, though we \textit{will} use them later to illustrate the term
\textbf{clock speed}.


% \subsubsection{Example}
% 
% The Mac-1 CPU is intended to be simple but representative of CISC machines.
% It is a mythical machine designed by Andrew Tanenbaum, a well-known computer
% science professor and author in the Netherlands. If you are interested, you
% can learn more about it in the writeup of a Mac-1 simulator written by Professor
% Rosalee Nerheim-Wolfe at DePaul University in Chicago; see \url{http://heather.cs.ucdavis.edu/~matloff/depaulmicmac.html}
% 
% The Mac-1 has only three programmer-visible registers,\footnote{
% In our model above, MAR, MDR and IR are not considered to be visible to the
% programmer, because machine instructions cannot explicitly access them. Mac-1
% has such registers too.
% } a PC and SP, and an \textbf{accumulator} register, AC. This is a rather small
% number of registers, even for a CISC machine, so SP and AC tend to play multiple
% roles. SP works not only like a stack pointer but also as a kind of an index
% register. Like most accumlators, AC works as a data register, but it too plays
% other roles, as we will see. (We will go into more detail in the upcoming chapters.)
% Mac-1 has 16-bit words and 12-bit (i.e. three hex digits) addresses. Mac-1 has
% no instructions capable of addressing individual bytes, so only words have addresses;
% the addresses of contiguous words therefore differ by 1.
% 
% Let us illustrate the fetch/execute cycle with an example of Mac-1's \textbf{addd}
% ({}``add-direct{}'') instruction. Applied to a memory location x, \textbf{addd}
% adds c(x) to the current value in AC, and stores the sum back in AC, i.e. \textbf{addd}'s
% action is
% 
% \begin{verbatim}
% 
% c(AC)  <--  c(AC) + c(x)
% \end{verbatim}
% 
% The coding for \textbf{addd} consists of the bit string 0010, i.e. 0x2, concatenated
% with the 12-bit coding for x. If \textbf{addd} is to access memory location
% 0x1c4, for instance, then 0x21c4, i.e.
% 
% \begin{verbatim}
% 
% 0010000111000100
% \end{verbatim}
% 
% Again, a program's instructions are stored in main memory during the program's
% execution. Say this instruction 0x21c4 is stored at memory location 0x803, and
% suppose that currently AC and memory location 0x1c4 contain 0x0002 and 0x0003,
% respectively. Then the sequence of actions which will take place during the
% execution of the instruction is as follows: 
% 
% \textbf{Step A:} To fetch the instruction, the CPU will copy c(PC) to the MAR,
% so c(MAR) becomes 0x803. The CPU also increments the PC by 1, to 0x804, so that
% when the next Step A comes around, the next instruction after the one at 0x803
% will be fetched. The CPU asserts a read line in the control bus, and the memory
% responds by sending c(0x803) along the data bus, from which it flows into the
% MDR. The value in MDR now will be 0x21c4, the fetched instruction. The CPU will
% now look as in Figure 2.3:
% 
% \vspace{0.3cm}
% {\par\centering \includegraphics{Fig2.3.eps} \par}
% \vspace{0.3cm}
% 
% (The figure merely reports {}``old value{}'' for c(SP) and c(IR), meaning
% that those contents have not changed from their old values, values which are
% irrelevant to the discussion.)
% 
% \textbf{Step B:} The CPU copies c(MDR) to the IR, and starts the decoding. For
% the latter, the CPU notices that the first four bits of the instruction are
% 0010, the \textbf{op code} for \textbf{addd}. The CPU will now look as in Figure
% 2.4:
% 
% \vspace{0.3cm}
% {\par\centering \includegraphics{Fig2.4.eps} \par}
% \vspace{0.3cm}
% 
% \textbf{Step C:} The CPU looks at the lower 12 bits of the instruction, and
% finds them to have the value 0x1c4. Since the action of \textbf{addd} is to
% add the contents of main memory at that location to AC, the CPU must fetch those
% contents. Thus the CPU copies those 12 bits to the MAR, and asserts the read
% line in the control bus. The memory responds by sending c(0x124) down the data
% bus; c(MDR) will now be 3, i.e. 0000000000000011. The CPU will now input c(MDR)
% and c(AC) into the ALU, and have the ALU do an addition. The sum, 5, will now
% be routed back to the AC, and Step C and the instruction are finished. The CPU
% will now look as in Figure 2.5: 
% 
% \vspace{0.3cm}
% {\par\centering \includegraphics{Fig2.5.eps} \par}
% \vspace{0.3cm}
% 
% There now will be another Step A, starting the execution of the next instruction,
% and so on.
% 

\section{Software Components of the Computer {}``Engine{}''}

There are many aspects of a computer system which people who are at the
learning stage typically take for granted as being controlled by
hardware, but which are actually controlled by software. An example of
this is the backspace action when you type the backspace key on the
keyboard. You are accustomed to seeing the last character you typed now
disapppear from the screen, and the cursor moving one position to the
left. You might have had the impression that this is an inherent
property of the keyboard and the screen, i.e. that their circuitry was
designed to do this. However, for most computer systems today this is
not the case. The bare hardware will not take any special action when
you hit the backspace key. Instead, the special action is taken by
whichever \textbf{operating system} (OS) is being used on the computer.

The OS is software---a \textit{program}, which a person or group of
people wrote to provide various services to user programs. One of those
services is to monitor keystrokes for the backspace key, and to take
special actions (move the cursor leftward one position, and put a blank
where it used to be) when encountering that key. When you write a
program, say in C, you do not have to do this monitoring yourself, which
is a tremendous convenience. Imagine what a nuisance it would be if you
were forced to handle the backspace key yourself: You would have to
include some statements in each program you write to check for the
backspace key, and to update the screen if this character is
encountered. The OS relieves you of this burden.

But if you want this ``burden''---say you are writing a game or a text
editor and need to have characters read as they are typed, instead of
waiting for the user to hit the Enter key---then the OS also will
turn off the backspace, echo etc. if you request it.  In UNIX, this is
done via the ioctl() system call.

This backspace-processing is an example of one of the many services that an
OS provides. Another example is maintenance of a file system. Again the theme
is convenience. When you create a file, you do not have to burden yourself with
knowing the physical location of your file on the disk. You merely give the
file a name. The OS finds unused space on the disk to store your file, and enters
the name and physical location in a table that the OS maintains. Subsequently,
you may access the file merely by specifying the name, and the OS service will
translate that into the physical location and access the file on your behalf.
In fact, a typical OS will offer a large variety of services for accessing files.

So, a user program will make use of many OS services, usually by calling
them as functions. For example, consider the C-language function
\textbf{scanf()}.  Though of course you did not write this function
yourself, someone did, and and in doing so that person (or group of
people) relied heavily on calls to an OS subroutine, \textbf{read()}. In
terms of our {}``look under the hood{}'' theme, we might phrase this by
saying that a look under the hood of the C \textbf{scanf()} source code
would reveal system calls to the OS function \textbf{read()}. For this
reason, the OS is often referred to as {}``low-level{}'' software. Also,
this reliance of user programs on OS services shows why the OS is
included in our {}``computer engine{}'' metaphor---the OS is indeed one
of the sources of {}``power{}'' for user programs, just as the hardware
is the other source of power. 

To underscore that the OS services do form a vital part of the
computer's {}``engine,{}'' consider the following example. Suppose we
have a machine-language program---which we either wrote ourselves or
produced by compiling from C---for a DEC computer with a MIPS CPU.
Could that program be run without modification on a Silicon Graphics
machine, which also uses the MIPS chip? The answer is no. Even though
both machines do run a Unix OS, there are many different {}``flavors{}''
of Unix. The DEC version of Unix, called Ultrix, differs somewhat from
the SGI version, called IRIX. The program in question would probably
include a number of calls to OS services---recall from above that even
reads from the keyboard and writes to the screen are implemented as OS
services---and those services would be different under the two OSs.

Thus even though individual instructions of the program written for the DEC
would make sense on the SGI machine, since both machines would use the same
type of CPU, some of those instructions would be devoted to OS calls, which
would differ.

Since an OS consists of a program, written to provide a group of services, it
follows that several different OSs---i.e. several different programs which offer
different groups of services---could be run on the same hardware. For instance,
this is the case for PCs. The most widely used OS for these CPUs is Microsoft
Windows, but there are also several versions of Unix for PCs, notably the free,
public-domain Linux and the commercial SCO.


\section{Speed of a Computer {}``Engine{}''}

What factors determine the speed capability of a computer engine? This is an
extremely complex question which is still a subject of hot debate in both academia
and the computer industry. However, a number of factors are clear, and will
be introduced here. The presentation below will just consist of overviews, and
the interested reader should pursue further details in books on computer architecture
and design. However, it is important that the reader get some understanding
of the main issues now; at the very least enough to be able to understand newspaper
PC ads! The discussion below is aimed at that goal.


\subsection{CPU Architecture}

Different CPU types have different instruction and register sets. The Intel
family, for example, has a very nice set of character-string manipulation instructions,
so it does such operations especially quickly. This is counterbalanced somewhat
by the fact that CPUs in this family have fewer registers than do CPUs in some
other families, such as those in most of the newer machines.\footnote{
This does not include the newer Intel chips, such as the Pentium, because these
chips had to be designed for compatibility with the older ones.
} Since registers serve as local---and thus fast---memory, the more of them we
have, the better, and thus the Intel family might be at a disadvantage in this
respect.


\subsection{Parallel Operations}

One way to get more computing done per unit time is do several things at one
time, i.e. in parallel. Most modern machines, even inexpensive ones such as
personal computers, include some forms of parallelism.

For example, most CPUs perform \textbf{instruction prefetch}: During
execution of the current instruction, the CPU will attempt to fetch one
or more of the instructions which follow it sequentially---i.e. at
increasing addresses---in memory.\footnote{ The prefetch may insted be
from the \textbf{cache}, which we will discuss later.  } For example,
consider an Intel chip running in 16-bit mode. In this mode the chip has
20-bit addresses. Consider a two-byte instruction at location 0x21082.
During the time we are executing that instruction the CPU might attempt
to start fetching the next instruction, at 0x21084. The success or
failure of this attempt will depend on the duration of the instruction
at 0x21082. If this attempt is successful, then Step A can be skipped in
the next cycle, i.e. the instruction at 0x21084 can be executed earlier.

Of course, the last statement holds only if we actually do end up executing
the instruction at 0x21084. If the instruction at 0x21082 turns out to be a
jump instruction, which moves us to some other place in memory, we will not
execute the instruction at 0x21084 at all. In this case, the prefetching of
this instruction during the execution of the one at 0x21082 would turn out to
be wasted. Modern CPUs have elaborate jump-prediction mechanisms to try to deal
with this problem.

Intel CPUs will often fetch \textit{several} downstream instructions, while
for RISC CPUs we might be able to fetch only one. The reason for this is that
RISC instructions, being so simple, have such short duration that there just
is not enough time to fetch more than one instruction.

Instruction prefetching is a special case of a more general form of parallelism,
called \textbf{pipelining}. This concept treats the actions of an instruction
as being like those of an assembly line in a factory. Consider an automobile
factory, for instance. Its operation is highly parallel, with the construction
of many cars being done simultaneously. At any given time, one car, at any early
stage in the assembly line, might be having its engine installed, while another
car, at a later stage in the line, is having its transmission installed. Pipelined
CPUs (which includes virtually every modern CPU) break instruction execution
down into several stages, and have several instructions executing at the same
time, in different stages. In the simple case of instruction prefetch, there
would be only two stages, one for the fetch (Step A) and the other for execution
(Steps B and C).

Most recently-designed CPUs are \textbf{superscalar}, meaning that a CPU
will have several ALUs. This is another way in which we can get more
than one instruction executing at a time.

Yet another way to do this is to have multiple CPUs! In
\textbf{multiprocessor} systems with n CPUs, we can execute n
instructions at once, instead of one, thus potentially improving system
speed by a factor of n. Typical speedup factors in real applications are
usually much less than n, due to such overhead as the need for the
several CPUs to coordinate actions with each other, and not get in each
other's way as they access memory. 

The classical way of constructing multiprocessor systems was to connect
several CPUs to the system bus.  However, as of 2006, it is common for
even many low-end CPU chips to be {\bf dual core}, meaning that the chip
actually contains {\it two} CPUs.  This has brought multiprocessor
computing into the home.

It used to be very expensive to own a multiprocessor machine.  The CPUs
would all be connected to the bus, which increases manufacturing costs,
etc.  But by having more than once CPU on a chip, the expense is small.

Note that chip manufacturers have found that it makes better economic
sense now for them to use chip space for extra processors, rather than
continuing to increase processor speed.  It also becomes quite difficult
to keep increasing processor speed.  Even now, chips are often designed
to be working on two different clock cycles at once---one part of the
chip is working on the next cycle while the other part still hasn't
received the clock signal for the current cycle.  Obviously this
requires extremely careful engineering.

\subsection{Clock Rate}

Recall that each machine instruction is implemented as a series of
\textbf{microsteps}.  Each microstep has the same duration, namely one
\textbf{clock cycle}, which is typically set as the time needed to
transfer a value from one CPU register to another. The CPU is paced by a
\textbf{CPU clock}, a crystal oscillator; each pulse of the clock
triggers one microstep.

In 2002, clock rates of over 1 \textbf{gigahertz}, i.e. 1 billion cycles
per second, became common in PCs. This is quite a contrast to the 4.77
megahertz clock speed of the first IBM PC, introduced in 1981.

Each instruction takes a certain number of clock cycles. For example, an addition
instruction with both operands in CPU registers might take 2 clock cycles on
a given CISC CPU, while a multiply instruction with these operands takes 21
clock cyles on the same CPU. If one of the operands is in memory, the time needed
will increase beyond these values.

RISC machines, due to the fact that they perform only simple operations (and
usually involving only registers), tend to have instructions which operate in
a single clock cylce.

The time to execute a given instruction will be highly affected by clock
rate, even among CPUs of the same type. For example, as mentioned above,
a register-to-register addition instruction might typically take 2
microsteps on a CISC CPU. Suppose the clock rate is 1 gigahertz, so that
a clock cycle takes \( 10^{-9} \) seconds, i.e. 1
\textbf{nanosecond}.\footnote{A nanosecond is a billionth of a second.
} Then the instruction would take 2 nanoseconds to execute.

Within a CPU family, say the Intel family, the later members of the
family typically run a given program much faster than the earlier
members, for two reasons related to clock cycles:

\begin{itemize}

\item Due to advances in fabrication of electronic circuitry, later
members of a CPU family tend to have much faster clock rates than do the
earlier ones.

\item Due to more clever algorithms, pipelining, and so on, the later
members of the family often can accomplish the same operation in fewer
microsteps than can the earlier ones. 

\end{itemize}

\subsection{Memory Caches}

\subsubsection{Need for Caching}

In recent years CPU speeds have been increasing at very impressive
rates, but memory access speeds have not kept pace with these increases.
Remember, memory is {\it outside} the CPU, not in it.  This causes slow
access, for various reasons.

\begin{itemize}

\item The physical distance from the CPU is on the order of inches,
whereas it is typically less than an inch within the CPU.  Even though
signals travel at roughly 2/3 the speed of light, there are so many
signals sent per second that even this distance is a factor causing
slowdown.

\item An electrical signal propagates more slowly when it leaves a CPU chip
and goes onto the bus toward memory.  This is due to the bus wires being
far thicker than the very fine wires within the CPU.

\item Control buses typically  must include \textbf{handshaking} lines,
in which the various components attached to the bus assert to coordinate
their actions.  One component says to the other, ``Are you ready,'' and
that component cannot proceed until it gets a response from the other.
This causes delay.

\end{itemize}

\subsubsection{Basic Idea of a Cache}

To solve these problems, a storage area, either within the CPU or near
it or both, is designed to act as a \textbf{memory cache}. The cache
functions as a temporary storage area for a \textit{copy} of some small
subset of memory, which can be accessed locally, thus avoiding a long
trip to memory for the desired byte or word.\footnote{Modern computers
typically have two caches, an {\bf L1} cache inside the CPU, and an
{\bf L2} cache just outside it.  When the CPU wants to access a byte or
word, it looks in L1 first; if that fails, it looks in L2; and if that
fails, it goes to memory.  Here, though, we will assume only an L1
cache.}

Say for example the program currently running on the CPU had in its
source file the line

\begin{verbatim}
char x;
\end{verbatim}

and suppose that {\bf x} is stored in byte 1205 of memory, i.e.  {\bf
\&x} is 1205.  Consider what occurs in each instruction in the program
that reads {\bf x}.  Without a cache, the CPU would put 1205 on the
address bus, and assert the MEMR line in the control bus.  The memory
would see these things on the bus, and copy whatever is in location 1205
to the data bus, from which the CPU would receive it.

If we did have a cache, the CPU would look first to see if the cache
contains a copy of location 1205.  If so, the CPU does not have to access
memory, a huge time savings.  This is called a cache {\bf hit}.  If the
CPU does not find that the cache contains a copy of location 1205---a
cache {\bf miss}---then the CPU must go to memory in the normal manner.

\subsubsection{Blocks and Lines}

Memory is partitioned into \textbf{blocks}, of fixed size.  For
concreteness, we will assume here that the block size is 512 bytes.
Then Bytes 00000-000511 form Block 0, Bytes 00512-001023 form Block 1,
and so on.  A memory byte's block number is therefore its address
divided by the block size.  Equivalently, since $log_2 512 = 9$, if our
address size is 32 bits, then the location's block number is given in
the most-significant 23 bits of the address, i.e. all the bits of the
address except for the lower 9.  The lower 9 bits then give location of
the byte within the block.\footnote{\label{class} Suppose we have 10  
classrooms, each with 10 kids, sitting in 10 numbered seats, according
to the kids' ID numbers.  Kids 0-9 are in classroom 0, 10-19 are in
classroom 1, and so on.  So for instance kid number 28 will be in
classroom 2, sitting in seat 8 of that room.  Note how the 2 comes from
the upper 1 digit of the ID number, and the 8 comes from the lower 1
digit---just like the block number of a byte is given by the upper 23
bits in the address, while the byte number within block is given by the
lower 9 bits.} Note that the partitioning is just conceptual; there is
no ``fence'' or any other physical boundary separating one block from
another.

This organization by blocks comes into play in the following manner.
Consider our example above in which the CPU wanted to read location
1205.  The block number is $\lfloor 1205/512 \rfloor = 2$.  The CPU
would not really search the cache for a copy of byte 1205 or even word
1205.  Instead, the CPU would search for a copy of the entire block
containing location 1205, i.e. a copy of block 2.

The cache is partitioned into a fixed number of slots, called
\textbf{lines}. Each slot has room for a copy of one block of memory,
plus an extra word in which additional information is stored.  This
information includes a {\bf tag}, which states which block has a copy in
memory right now.  This is vital, of course, as otherwise when the CPU
looked in the cache, it wouldn't know whether this line contains a copy
of the desired block, as opposed to a copy of some other block.

At any given time, the cache contains copies of some blocks from memory,
with different blocks being involved at different times.  For example,
suppose again the CPU needs to read location 1205.  It then searches the
cache for a copy of block 2.  In each line in which it searches, the CPU
would check the tag for that line.  If the tag says ``This line contains
a copy of block 2,'' then the CPU would see that this is a hit.  The CPU
would then get byte 1205 from this line as follows.  As we saw above,
byte 1205 is in block $\lfloor 1205/512 \rfloor = 2$, and using the same
reasoning,\footnote{Recall note \ref{class}.  Make SURE you understand
this point.} {\it within} that block, this byte is byte number 1205 mod
512 = 181.  So, the CPU would simply look at the $181^{st}$ byte in this
line.\footnote{This means the block begins with its ``$0^{th}$'' byte.} 

If on the other hand, the CPU finds that block 2 does not have a copy in
the cache, the CPU will read the {\it entire} block 2 from memory, and
put it into some cache line.  It may have to remove a copy of some other
block in doing so; the term used for this is that that latter block copy
is {\bf evicted}.

The total number of lines varies with the computer; the number could be
as large as the thousands, or on the other extreme less than ten.  The
larger the cache, the higher the hit rate.  On the other hand, bigger
caches are slower, take up too much space on a CPU chip in the case of
on-chip caches, and are more expensive in the case of off-chip caches.

\subsubsection{Direct-Mapped Policy}

One question is which lines the CPU should search in checking whether
the desired block has a copy in the cache.  In a {\bf fully-associative}
cache, the CPU must inspect all lines.  This gives maximum flexibility,
but if it is large then it is expensive to implement and the search
action is slower, due to the overhead of complex circuitry.

The other extreme is called {\bf direct-mapped}.\footnote{A compromise
scheme, called {\bf set-associative}, is very common.}  Here the CPU
needs to look in only {\it one} cache line.  By design, the desired
block copy will either be in that particular line or not in any line at
all.  The number of the particular line to be checked is the block
number mod the number of lines.  Let us suppose for illustration that
there are 32 lines in the cache.  In our example above, byte 1205 was in
block 2, and since 2 mod 32 = 2, then to check whether block 2 has a
copy in the cache, the CPU needs to look only at line 2.  If the tag in
line 2 says ``This is a copy of block 2,'' then the CPU sees we have a
hit.  If not, the CPU knows it is a miss.  The basic design of this
direct-mapped cache is that a copy of block 2 will either be in the
cache at line 2 or else the block will not have a copy in the cache at
all.

Now suppose after successfully reading byte 1205, at some time later in
the execution of this program we need to read byte 17589.  This is block
$\lfloor 17589/512 \rfloor = 34$.  Since 34 mod 32 = 2, the CPU will
again look at line 2 in the cache.  But the tag there will say that this
line currently contains a copy of block 2.  Thus there will be a miss.
The CPU will bring in a copy of block 34, which will replace the copy of
block 2, and the tag will be updated accordingly.

\subsubsection{What About Writes?}
\label{write}

One issue to be resolved is what to do when we have a cache hit which is
a write.  The issue is that now the copy of the block in the cache will
be different from the ``real'' block in memory.  Eventually the two must
be made consistent, but when?  There are two main types of policies:

\begin{itemize}

\item Under a {\bf write-through} policy, any write to the cache would
also be accompanied by an immediate corresponding write to memory.

\item Under a {\bf write-back} policy, we do not make the memory block
consistent with its cache copy until the latter is evicted.  At that
time, the entire block is written to memory.

\end{itemize}

Which policy is better will depend on the program being run.  Suppose
for example we only write to a cache line once before it is evicted.
Then write-back would be rather disastrous, since the entire block would
be written to memory even though the inconsistency consists of only
one byte or word.  On the other hand, in code such as

\begin{verbatim}
for (i = 0; i < 10000; i++) 
   sum += x[i];
\end{verbatim}

if {\bf sum} is the only part of its cache line being used, we only care
about the final value of {\bf sum}, so updating it to memory 10,000
times under a write-through policy would be highly inefficient.

\subsubsection{Programmability}
\label{able}

As noted in Section \ref{write}, the behavior of a particular write
policy depends on which program is being run.  For this reason, many
CPUs, e.g. the Pentium, give the programmer the ability to have some
blocks subject to a write-through policy and others subject to
write-back.

Similarly, the programmer may declare certain blocks not cacheable at
all.  The programmer may do this, for example, if the programmer can see
that this block would have very poor cache behavior.  Another reason for
designating a block as noncacheable is if the block is also being
accessed by a DMA device.

\subsubsection{Details on the Tag and Misc. Line Information}

Consider line 2 in the cache.  We will search that line if the number of
the block we want is equal to 2, mod 32.  That means that the tag, which
states the number of the block copy stored in this line, must have 00010
as its last 5 bits.  In view of that, it would be wasteful to store
those bits.  So, we design the tag to consist only of the remainder of
the block number, i.e. the upper 18 bits of the block number.

Let's look at our example 17589 above again.  Recall that this is in
block 34 = 00000000000000000100010.  Since the last 5 bits are 00010,
the CPU will examine the tag in cache line 2.  The CPU will then see
whether the tag is equal to 000000000000000001, the upper 18 bits of 34.
If so, it is a hit.

At the beginning, a line might not contain any valid data at all.  Thus
we need another bit in the line, called the Valid Bit, which tells us
whether there is valid data in that line (1 means valid).

If a write-back policy is used, we need to know whether a given cache
line has been written to.  If it has, then the line must be copied back
to memory when it is evicted, but if not, we would want to skip that
step.  For this purpose, in addition to the tag, the line maintains a
Dirty Bit, ``dirty'' meaning ``has been written to.''

Since we saved some bits in the storage of the tag, we use that space
for the Valid and Dirty Bits. 

% \subsubsection{Transparency to the Programmer}
% 
% It is important to note that data registers and a memory cache, though
% both having the goal of avoiding memory access, are very different in
% their {}``visibility{}'' to the programmer. The programmer directly
% accesses the registers, by including instructions in his/her program to
% access them.  A cache, on the other hand, is \textbf{transparent} to the
% programmer, meaning that the CPU does all the management of the cache,
% i.e. checking for hit/miss, doing the block replacement upon finding a
% miss, etc.  As far as the programmer is concerned, it \textit{appears}
% as if all memory accesses really are to main memory, rather than some of
% them being intercepted by the cache.\footnote{The exception, of course,
% is that the programmer can get actively involved, as seen in Section
% \ref{able}.  But this is rarely used, and would usually not be available to
% ordinary programmers anyway, since these instructions require a privileged
% user, i.e. the OS.} 
% 
% However, even though the actions of the cache are transparent to the
% programmer, it is important that the programmer recognize the existence
% of a cache in designing his/her code. For example, consider the
% following two equivalent pieces of C code, which both calculate the sum
% of all elements in a two-dimensional array declared to have M rows and N
% columns of integers:
% 
% \begin{verbatim}
% 
% Version I:
% 
% Sum = 0;
% for (I = 0; I < M; I++)
%    for (J = 0; J < N; J++)
%       Sum += X[I][J];
% 
% 
% Version II: 
%  
% Sum = 0;
% for (J = 0; J < N; J++)
%    for (I = 0; I < M; I++)
%       Sum += X[I][J];
% \end{verbatim}
% 
% Both versions are {}``correct,{}'' i.e. will give the same answer.
% However, they access memory in completely different patterns, that is,
% in different orders.  
% 
% Suppose, for simplicity: The machine has 16-bit words; block size is 64
% bytes; the cache is associative and has 2 cache lines; M = 4 and N = 64;
% the compiler has placed the array X to begin at location 0x0c80, i.e.
% 3200 base-10, right at the beginning of Block 50,\footnote{The reader
% should ponder how the analysis below would change if the starting
% address of X were, say, 3204.} since 3200/64 = 50; the array X is
% global and the compiler stores it in nonreverse order, i.e.
% X{[}0{]}{[}0{]} is at location 0x0c80, X{[}0{]}{[}1{]} is at 0x0c82, and
% so on.  Suppose also that {\bf Sum} is stored in a register (via C's
% {\bf register} construct) and that there is a separate cache for
% instructions.
% 
% Assume that the cache is initially empty. Then the reader should verify that
% Version I of the code will result in eight cache misses, but Version II will
% generate a lot more---256, a hit rate of 0! Here are the first few actions in
% the case of Version II:
% 
% \begin{itemize}
% \item [1.] Cache is empty.
% \item [2.] Program tries to access X{[}0{]}{[}0{]}, resulting in a cache miss.
% \item [3.] CPU loads the whole block containing X{[}0{]}{[}0{]}, which will be X{[}0{]}{[}0{]},
% X{[}0{]}{[}1{]}, X{[}0{]}{[}2{]}, ..., X{[}0{]}{[}31{]}, into the first cache
% line.
% \item [4.] Program tries to access X{[}1{]}{[}0{]}, resulting in another cache miss.
% \item [5.] CPU loads the whole block containing X{[}1{]}{[}0{]}, which will be X{[}1{]}{[}0{]},
% X{[}1{]}{[}1{]}, X{[}1{]}{[}2{]}, ..., X{[}1{]}{[}31{]}, into the second cache
% line.
% \item [6.] Program tries to access X{[}2{]}{[}0{]}, resulting in another cache miss.
% \item [7.] CPU loads the whole block containing X{[}2{]}{[}0{]}, which will be X{[}2{]}{[}0{]},
% X{[}2{]}{[}1{]}, X{[}2{]}{[}2{]}, ..., X{[}2{]}{[}31{]}, into the first cache
% line, since it is least recently used.
% \item [8.] Et cetera. 
% \end{itemize}
% 
% So, two supposedly equivalent pieces of code might have very
% \textit{different} run times. Once again, it does pay to {}``look under
% the hood.{}''

\subsubsection{Why Caches Usually Work So Well}

Experience shows that most programs tend to reuse memory items
repeatedly within short periods of time; for example, instructions
within a program loop will be accessed repeatedly. Similarly, they tend
to use items which neighbor each other (as is true for the data accesses
in Version I above), and thus they stay in the same block for some time.
This is called the principle of {\bf locality of reference}, with the
word ``locality'' meaning ``nearness''---nearness in time or in space.

For these reasons, it turns out that cache misses tend to be rare.  A
typical hit rate is in the high 90s, say 97\%.  This is truly remarkable
in light of the fact that the size of the cache is tiny compared to the
size of memory.  Again, keep in mind that this high hit rate is due to
the locality, i.e. the fact that programs tend to NOT access memory at
randomly-dispersed locations.

% {\bf Multi-level Caches:}
% 
% Most modern CPU chips have internal caches. In addition, it is common to
% also include a larger, secondary cache.  The primary cache, on the CPU
% chip, is called the {\bf Level 1} (or L1) cache.  A secondary {\bf Level
% 2} (or L2) cache would be external to the CPU but between the CPU and
% the system bus.  Of course, we could add further levels too.
% 
% Say we have both an L1 and L2 cache.  The contents of L1 will be a
% subset of those of L2, which in turn will be some subset of memory.
% Whenever the CPU generates a memory request, it checks L1 first.  If a
% miss occurs there then L2 is checked; only if the item is missing there
% too will memory be accessed.  Similarly, if L1 does a write-back, it
% will be to L2, not to memory; later, if the corresponding line in L2 is
% evicted, {\it it} will be written back to memory.
% 
% The reason for the contents of L1 being a subset of that of L2 is as
% follows.  Say a block has been evicted from L1.  It still may be the
% case that we need that block again in the near future, so by having it
% in L2 it will be still quickly accessible.
% 
% Thus L2 will be larger than L1.  Since that means it will also be more
% expensive, it is generally made of cheaper, i.e. slower memory than that
% of L1.
% 
% Ironically, even though L2's larger size might seem to imply a higher
% hit rate than for L1, L2's hit rate will probably be lower.  The reason
% for this is that the memory accesses which are not caught by L1 are
% likely to be much more randomly dispersed in memory; since L2 handles
% these accesses, it faces a lower degree of locality, and thus a lower
% hit rate.


\subsection{Disk Caches}

We note also that one of the services an OS might provide is that of a
\textbf{disk cache}, which is a software analog of the memory caches
mentioned earlier. A disk is divided into \textbf{sectors}, which on PCs
are 512 bytes each in size.  The disk cache keeps copies of some sectors
in main memory. If the program requests access to a certain disk sector,
the disk cache is checked first. If that particular sector is currently
present in the cache, then a time-consuming trip to the disk, which is
much slower than accessing main memory, can be avoided.\footnote{Note,
though, that the access to the cache will still be somewhat slower than
the access to a register.  Here is why:

Suppose we have 16 registers.  Then the register number would be
expressed as a 4-bit string, since $16 = 2^4$.  By contrast, say our
cache contains 1024 lines, which would mean that we need 10 bits to
specify the line number.  Digital circuity to decode a 10-bit ID is
slower than that for a 4-bit ID.  Moreover, there would be considerable
internal circuitry delays within the cache itself.}  Again this is
transparent to the programmer. The program makes its request to the
disk-access service of the OS, and the OS checks the cache and if
necessary performs an actual disk access. (Note therefore that the disk
cache is implemented in software, as opposed to the memory cache, which
is hardware, though hardware disk caches exist too.) 

\subsection{Web Caches}   

The same principle is used in the software for Web servers.  In order to
reduce network traffic, an ISP might cache some of the more popular Web
pages.  Then when a request for a given page reaches the ISP, instead of
relaying it to the real server for that Web page, the ISP merely sends
the user a copy.  Of course, that again gives rise to an update problem;
the ISP must have some way of knowing when the real Web page has been
changed.