

\documentstyle{article}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\topmargin}{0.0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9.1in}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.12in}

\begin{document}

\newcommand{\hsp}{\hspace{0.035in}}

\begin{center}
{\LARGE\bf The Problem of Memory Access Speed

\copyright N.S. Matloff, 1993}
\end{center}

\bigskip

\section{Overview}

Processor speeds have been increasing rapidly, while memory-access
speeds have been increasing less rapidly.  Problems such as the
following arise:

\begin{itemize}
\item it is difficult/expensive to have memory chip response time
as low as a single CPU clock cycle (e.g. the latter is 20 ns for a
50 MHz processor, and 70 ns memories are typical affordable ones today

\item off-chip signal propagation is slower than for the on-chip case;
bus signal propagation in particular is slow---the more devices which are
attached to the bus, the more time it takes to produce a change in a bit

\item in a {\bf synchronous} bus, coordination between CPU, memory and I/O 
components is done by having a bus clock and relying on certain events 
(e.g. when a memory read is finished) occurring at specific times; this
may slow things down, as the clock may have to be set for the slowest
device attached to the bus

\item in an {\bf asynchronous} bus, coordination between devices attached to
the bus is done by asserting special {\bf handshaking} lines---``I want to
read,'' ``OK, here it is,'' ``OK, got it,'' etc.---which add to access time
\end{itemize}

{\bf Solutions:}  There are two general approaches to these problems.
One is to implement at least some memory in some very fast-access form.
The other approach is to ``hide'' the problem, by performing memory access
activities {\it in parallel with} other activities.  Some examples 
follow.

\section{Registers}

Registers, being right there in the CPU and therefore fast-access, are
a form of fast memory.  It is thus desirable to have as many registers
as possible (everything else being equal). 

\section{Instruction/Data Access in Parallel with Execution of Other
Instructions}

\subsection{Instruction Prefetch}

This involves fetching the next instruction(s) while executing the
current one.  Most modern CPUs do this; for example, the Intel 386
has a 16-byte prefetch buffer.

{\bf Problem:}  If there is a (taken) branch, the prefetch is wasted.

\subsection{Asynchronous Load/Store Operations}

For example, consider the following generic assembly language (in which
R2 is a register and X is a location in memory):

\begin{verbatim}
LOAD R2,X     (R2 <-- X)
ADD R3,R4
ADD R3,R2
\end{verbatim}

We can design the CPU so that execution of the instructions following the
LOAD proceed while the LOAD is still pending.  Until the LOAD operation
completes, these instructions will be executed if they are ``independent,''
i.e. do not depend on R2.  Thus ADD R3,R4 would be executed but the next
ADD would not be.

{\bf Problems:}  First, the CPU design process becomes much more complex.
Each register must get an extra bit to indicate that it is ``busy,'' and
logic must be installed to set such bits, and to check them before executing
any given instruction.  

Second, there may not be very many, if any,
``independent'' instructions which follow the LOAD or STORE.  This problem
can be ameliorated somewhat, though, by the compiler, which can search for
``independent'' instructions which can safely moved (i.e. moved without
affecting correct operation of the program) to a position following the
LOAD or STORE.

For example, if the original code had been

\begin{verbatim}
LOAD R2,X    
ADD R3,R4
ADD R3,R2
ADD R3,R6
\end{verbatim}

the compiler\footnote{Or more precisely, the optimizing portion of the
compiler, which usually does its work as postprocessing after the ordinary
compilation is done.} could change the code to

\begin{verbatim}
LOAD R2,X
ADD R3,R4
ADD R3,R6
ADD R3,R2
\end{verbatim}

Note that a similar change could have been done if ADD R3,R6 had 
originally been at the line just above the LOAD.

\section{Cache Memory}

\subsection{Overview}

A memory cache is a small, fast memory which contains copies of various
portions of main memory (different portions at different times).\footnote{This
should not be confused with a disk cache, which contains copies of some of
the disk sectors, not copies of portions of main memory, and which is
handled in software, not hardware.}  Whenever the CPU wants to access
memory, it checks the cache first.  If the item is in the cache, then
the CPU has saved a time-consuming trip to memory; if not, though, the
CPU must go to memory for the item, and bring the item into the cache,
at the same time evicting some other item in the cache in order to make
room for the new one.

If the CPU finds what it wants in the cache, this is termed a {\bf cache
hit}, otherwise a {\bf cache miss}.

The cache could either be a part of the CPU itself, or a separate component
in between the CPU and the system bus.  Many PCs today, for example, have
both:  The Intel 486 CPU has a small cache on the processor chip, and
many vendors add a larger cache externally.

Note that caches derive their speed from two factors:

\begin{itemize}
\item Proximity.  An internal cache is obviously ``close'' to the CPU,
since it is \underline{inside} the CPU, but even an external cache has
a proximity advantage, since we at least avoid the bus delays mentioned
previously (though we still incur an off-chip delay penalty).

\item Faster electronics technology.  Since the cache is smaller, we can
afford to use expensive, fast memory (say 20 ns) for its implementation.
\end{itemize}

Note that the cache can contain anything which is in memory, i.e. not only
data but also instructions.  Some machines even have separate caches for
instructions and data.

\subsection{Blocks and Lines}

Think of the memory as being divided into {\bf blocks}, say of 512 bytes
each.  Bytes 0-511 would be block 0, 512-1023 would be block 1, and so
on.  In general, location i of memory would be in block i div 512.

The cache is divided into {\bf lines}.  Each line holds one block of
memory at any given time (again, different blocks at different times).
Each line has a {\bf tag} which indicates which block copy is currently
in the line.

\subsection{Read and Write Operations}

If the CPU wants to do a memory read of location i, it first looks for
i in the cache.  To do so, it first computes the number b of the block to
which i belongs, by finding b = i div 512.\footnote{Note that no actual
division is done here, since 512 is a power of 2.  For a 32-bit-address
machine, for example, the most significant 23 bits of i would be b, so
the digital logic would merely copy those bits, obviating the need for
an actual division.}  It will then search the cache for a copy of block
b.  If this search is successful, then the copy of location i will be
obtained from the line in which the block copy was found.\footnote{In
this case, the lower 9 bits of i would be used.  In other words, a cache
line looks just like ordinary RAM (albeit faster), but with fewer
address bits, in this case only 9 instead of 32.} 
On the other hand, if
the search is unsuccessful, then a copy of block b will be read from
memory---the entire block, not just i---and placed in some cache line,
displacing whatever block copy had been there previously. 

Writes, though, are more complicated.  The first few steps are the same
as those for reads:  The cache will be searched for the given block, and
if not found, the block will be read in from memory.  However, once the
write is made to i, there will be a discrepancy between the copy of i
and its ``real'' version in memory, with the latter being incorrect.
Two design choices for handling this problem are common:  

\begin{itemize}
\item

Write-Through Policy:  Under this policy, the copy of i in memory is
immediately updated.

\item 

Write-Back Policy:  Under this policy, we merely make a ``promise'' to update
i in memory later on---specifically, at the time this block copy in the cache
is evicted; at that time, the entire line will be written to that block
in memory (even though most of the block will probably have not been
changed).

The ``promise'' mentioned above takes the form of a {\bf dirty bit},
which will be set to 1 whenever a write is done.  Eventually this
block copy will be evicted, at which time the dirty bit will be checked;
if that bit is 1, then the block will be written back to memory (and
of course, at that time the bit will be cleared to 0).
\end{itemize}
Which policy is better?  It will depend on which program we are running.
Consider a loop which contains the instruction

\begin{verbatim}
STORE X,R2   (X <-- R2)
\end{verbatim}

Suppose the loop has a very large number of iterations.  In this case the
write-through method will be wasteful in its writes of X to main memory;
at the time this block is finally evicted, we care about only the 
\underline{last} value written for X, so the previous writes were
unnecessary.\footnote{This assumes that there is only one CPU.  In a
multiprocessor system, the situation would change completely, because
other processors may need the up-to-date value of X.  This leads to the
issue of {\bf cache coherency}, i.e. consistency of values between a
cache and main memory, and between one cache and other caches which
contain a value for that block.}  In this case, the write-through
method might be better.

On the other hand, suppose the loop iterates only once or twice.  Then
the write-back method is wasteful since, at the time the block is evicted,
it may be the case that only X has changed, so it is wasteful to write
the entire block back to memory. 

\subsection{How the Search for a Block Copy is Done}

We will look at three design alternatives.  

Let l denote the number of lines in the cache.  Suppose the CPU wants to
access address i in memory, which is in block b.

\subsubsection{Direct-Mapped Search}

Under this scheme, block b will either have a copy in line b mod l or
will have no copy anywhere in the cache.  So, the ``search'' in the
cache for block b will be extremely simple:  Look at the tag for
line b mod l, and see whether the tag is equal to b.  

Note that it is equally simple to decide which block to evict when a
cache miss occurs:  The new block will go into line c mod l, where c
is the number of the new block; whatever is in line c mod l currently
will get replaced.

The simplicity of the direct-mapped approach is quite attractive, as
it makes the design small, cheap and fast.

\subsubsection{Associative-Mapped Search}

In this scheme, the search for block b is done at \underline{all} lines!
And, since a sequential search of all lines would be intolerably long,
all lines are searched simultaneously.  Thus, the digital logic involved
is much more complex (so the cache takes up much more space, has more
gate delay, is more expensive, etc.).

The term {\bf associative memory} means memory which is accessed by
content rather than by address.  Here in the cache setting, for instance,
we are looking at all tags (they're the memory here), saying, ``Get me
the number (``address'') of the tag whose content is b.''

If a cache miss occurs, what block is evicted?  There are many policies
for this in the associative-mapped setting, but most of them are variations
on the Least Recently Used (LRU) policy; the name is self-explanatory, i.e.
whichever currently-resident block has not been accessed for the longest
time will be evicted.

The associative-mapped policy is more flexible than the direct-mapped one.
Consider the case l = 8, for instance.  Then we could not simultaneously
have two blocks in the cache whose numbers differed by a multiple of 8, since
they would be mapped to the same line.

\subsubsection{Set-Associative-Mapped Search}

This is a compromise between the previous two.  We have {\bf sets} of
lines, say k lines per set.  If block b is in the cache, it will be in
set s = (b mod l/k).  Then, the entire set s is searched associatively.

\subsection{Who Does What?}

Consider a system whose only cache is one external to the CPU, between the
CPU and the bus.  Note carefully that all the cache operations we are
referring to here---search for a block copy, actions taken upon a cache
miss, etc.---are completely {\bf transparent} to the CPU.  The CPU doesn't
``know'' what is happening here, and indeed thinks that it is accessing
main memory.  Any handshaking signals which normally would be sent by
the main memory to the CPU are sent by the cache instead, and the CPU
will not be able to tell the difference (except of course, that they
come more quickly).

If the cache is internal to the CPU, then of course the above description
doesn't apply, since the cache is part of the CPU.  Thus for example,
cache operations are certainly not transparent to the CPU.

In any case, though, the cache operations are transparent to the programmer.
The programmer does not include instructions in the program to do a search
for a block copy in the cache, etc.; it's all done by the hardware.  So
if the programmer, for example, includes an instruction

\begin{verbatim}
LOAD R2,X
\end{verbatim}

in which a memory read is done, the programmer has no way of knowing whether 
this read request was done at the cache or at main memory.\footnote{Some CPUs,
such as the 486, do allow the programmer some very limited control of the
cache.  The programmer can tell the cache \underline{not} to cache a
certain block, for example; e.g. this would be useful if the programmer
knew that only one word from that block would be read, and would be read
only once, so that caching the block would actually be harmful.  

And even without such control, the programmer may wish to write the program
in such a way as to make the best use of the cache.  For example, suppose
we are writing a program which will do some matrix operation, requiring
a nested {\bf for} loop.  We may have a choice of having the outer loop
control matrix row and the inner one matrix column, or vice versa.  One
of these orders may produce a much higher cache hit rate than the other, so 
if we know about the details of the cache (block size, write policy, etc.),
we may wish to choose one over the other when we write the program.}

\subsection{Scale and Performance}

\subsubsection{Cache Size}

This can run from hundreds of bytes (e.g. in the early models of the
Motorola 68000 series, in the Macintosh line), to thousands of bytes
(Intel 486 built-in cache), to tens or hundreds of thousands (external
caches in many higher-end PCs, workstations).

Yet even hundreds of thousands of bytes comprise a small fraction of
main memory size.  In that light, how can caches do so well, with
hit rates well over 90\%?  The answer lies in the Principle of
Locality of Reference, which actually is two principles:

\begin{itemize} 
\item {\bf temporal locality}:  once an item is referenced, it is likely to be
referenced again a short time later

\item {\bf spatial locality}:  once an item is referenced, items near it in
memory are likely to be referenced too
\end{itemize}

Example:

\begin{verbatim}
for (i = 0; i < 1000; i++) 
   sum += x[i];
\end{verbatim}

The machine instructions compiled from this loop will each be executed
1,000 times, thus fetched 1,000 times; this illustrates temporal locality.

Also, x[i+1] will be referenced right after x[i], and will be stored
right after x[i] in memory; this illustrates spatial locality.

Because of these principles, the hit rate tends to be high, but for some
particular programs, or some particular parts of particular programs,
the hit rate could be low.

\subsubsection{Tradeoffs in Block Size}

Suppose we have, for instance, an on-chip cache, and have decided on
a fixed amount of ``real estate'' for it.  Given that, what block size
should we use?

If the block size is too small, then we would need lots of lines to
get a given amount of data in the cache.  Each line has overhead---the
tag, dirty bit, LRU records, etc.---and this overhead would then be
multiplied by the number of lines, with the result that we would not
be able to fit as much data into our allocated real estate.

Larger blocks mean fewer lines, and thus more cache misses for programs 
which jump around in memory in a regular-but-hitting-many-places kind of
pattern.

\subsection{Basis for Design Decisions}

Typically sets of various {\bf benchmark} programs are used to decide
factors such as cache size, block size search/replacement policy, and
so on.  A simulator for a given architecture is written, and then the
various benchmark programs will be run on the simulator under different
conditions, e.g. different block sizes, and runtimes (in simulated clock
cycles) will be recorded.  Then the decisions will be made
according performance/cost tradeoff criteria.

\section{Memory Interleaving}

The idea of memory {\bf interleaving} is to split main memory into a
number of physically separate components, called {\bf modules} or
{\bf banks}.  The goal is to deal with the slow memory problem by
making several memory requests simultaneously, each to different
modules.

\subsection{Types of Interleaving}

For concreteness, let us suppose that our machine has 32-bit addresses.
Note that this produces an address space of 4 gigabytes, or over 4 billion
bytes ($2 ^ {32}$, to be precise).  Let's suppose that we will have 16 memory
modules, called M0, M1, ..., and M15. 

Under {\bf high-order interleaving}, consecutive memory addresses will
(except at boundary points) be stored within the same physical module,
while under {\bf low-order interleaving}, consecutive memory addresses
will be in consecutive modules (with M0 being considered ``consecutive''
to M15).

With high-order interleaving, the i-th module would contain system
addresses $i 2 ^ {28}$ through $(i+1) 2 ^ {28} - 1$.  An address in this
range is distinguished by the higher 28 bits having the value i.

For low-order interleaving, the lower 4 bits determine the module number;
if the lower 4 bits evaluate to i, then this address is stored in module i.

\subsection{Uses of Interleaving}

\subsubsection{Vector Processors}

A {\bf vector processor} has this name because its instruction set includes
instructions which operate on vector quantities.  For example, such a CPU
may have both an ordinary scalar addition instruction ADD, in which say

\begin{verbatim}
ADD X,Y
\end{verbatim}

would mean to add the word Y to the word X in memory, \underline{and} also
a vector-addition counterpart VADD, in which say

\begin{verbatim}
VADD X,Y,N
\end{verbatim}

would mean to add the N-component vector starting at Y to the N-component
vector starting at X; in other words, the word at Y would be added to the 
word at X, the word at Y+1 would be added to the word at X+1,..., and the
word at Y+N-1 would be added to the word at X+N-1.

This kind of machine is very useful in fields like physics and mechanical
engineering, in which various operations are performed on huge matrices.
The most famous example is the Cray line of computers.

Obviously the VADD operation could be done using a loop, inside of which
there is an ADD instruction, and thus the VADD instruction isn't needed.
But the VADD can do the operation must faster, essentially because of
pipelining:\footnote{Plus of course, the time saved to having one only
instruction fetch and decode.}

First of all, the basic (scalar) addition operation is pipelined.  Addition
of two floating-point numbers is done in several stages:

\begin{itemize}
\item [(a)] separate the mantissa and exponent fields in each of the two
numbers\footnote{E.g. in the number 2.3 x $10 ^ {-12}$, the 2.3 is called
the {\bf mantissa} and the -12 the {\bf exponent}.  For a 32-bit word
size, typically 23 bits are devoted to the mantissa and 9 to the exponent.}
 
\item [(b)] if the two exponents are not equal, rearrange the mantissa
and exponent\footnote{E.g. 2.3 x $10 ^ {-12}$ could be rearranged to
0.23 x $10 ^ {-11}$.}

\item [(c)] add the two mantissas

\item [(d)] if the sum of the mantissas is longer than 23 bits, do another
rearrangement of this sum and the common exponent
\end{itemize}

Secondly, and of our main interest here, the memory can also be ``pipelined,''
by using lower-order interleaving.  For example, the bus can be of the
{\bf split-transaction} type, which means that several memory requests
can be outstanding at once.  

If a vector, or at least part of a vector, lies in consecutive memory
locations, this scheme will work quite well.  Suppose for instance that
we wish to read locations 32, 33, 34, etc. in memory.  The CPU sends out a 
read request for 32, then for 33 (without waiting for 32 to complete), then
for 34 (without waiting for 32 and 33 to complete), and so on.  They will 
return in the same order.

The point is that this way we will be doing several things at once.  The
memory fetch has become the fifth stage of the pipeline described above.
(And if the sum is stored back to memory, we would have another stage,
etc.)

Of course, this only works well when different components of the vector
are in different modules.  If the vector were a column in a 16-column
matrix, for instance, and we were using the C language (which uses
row-major order for two-dimensional arrays), all of the components of
the vector would be in the same module, and we would get no benefit. 

\subsubsection{Shared-Memory Multiprocessors}

Here a number of processors share a common memory, i.e. they are all running
the same program, whose global variables are accessible by all the processors.

In this case, memory can be a real bottleneck.  Interleaving---either
high-order or low-order---can be very helpful.  Usually at a given instant,
different processors are accessing different locations in memory, and with
the interleaving, this will often mean different modules.  This means less
queuing for memory by the processors.

On a bus-based system, for example,\footnote{The larger multiprocessor
systems use elaborate interconnection networks instead of buses.} we can
again use the split-transaction method (though in this case the memory
module must also return the processor number when it completes a request,
so that the processor which originated the request will know that this
data is for it).

\end{document}


