Computer Organization and Design: The Hardware/Software Interface
David Patterson and John Hennessy
Morgan Kaufman Publishers
Second Edition, 1998

Supplementary Explanations by Professor Norm Matloff

Last updated December 15, 1999

p.249:

Here is how to verify the 5-cycle claim made by the authors. First, the authors must be using fan-in larger than 2, because otherwise the time for C4 would be much more than 5. Upon closer inspection (trial and error, trying to match the authors' figure of 5 cycles for C4), one finds that they are always assuming as much fan-in as possible. The pi and gi are ready after Cycle 1. Then if we use AND gates with fan-in of 4, the Pi are ready after Cycle 2 and the Gi are ready after Cycle 3. Then using AND and OR gates of fan-in 5, C4 will be ready 2 cycles after the Gi, i.e. after Cycle 5.

p.540:

Throughout this chapter, keep in mind the meaning of the term word in the computer context. Suppose for example that we have a 32-bit CPU. That means that registers are 32 bits wide, and the ALU operates on 32-bit quantities. For example, the ALU will add together two 32-bit quantities, producing a 32-bit sum. Our memory is then made up of words too, of this same size.

Each word in the system has an "ID number" called an address. Most computers also give addresses to individual bytes within a word. This is needed for byte operations, as will be seen below.

Suppose again that word size is 32 bits, i.e. 4 bytes. The bytes in memory will be numbered 0, 1, 2, 3, 4, 5, ... The first word in memory will be called word 0 will then consist of bytes 0-3, the second word is word 4, consisting of bytes 4-7 and so on. In other words, the address of a word is taken to be the address of the lowest-numbered byte in that word.

Most computers have both word and byte operations. On Intel-based machines, for instance, consider the instruction

MOV AX,[200]

The MOV ("move") operation copies one operand to the other. Here the word 200, consisting of bytes 200-203, would be copied to the AX register. By contrast,

MOVB AX,[200]

would copy byte 200 to AX.

Suppose the program had been compiled from C source code. Then the MOV instruction would have been generated if location 200 had contained a variable of type int, while the MOVB would be used for type char.

p.540ff:

The reader may wonder what the difference is between a register and a cache, especially since the latter may be part of the CPU, as a register is. Both are intended as fast-access storage, but the difference lies in the programmer's view of things.

The programmer "sees" a register, in the sense that he/she can explicitly access a register by specifying it within an instruction. By contrast, the programmer cannot access the cache via an instruction. (Some CPUs do allow some types of very limited cache access via instructions, but these are highly specialized.)

For instance, consider the Intel instruction

ADD AX,[200]

This adds the contents of location 200 in memory to the CPU register AX. The programmer has explicitly referred to AX in this instruction, but the programmer does not know whether memory location has a copy in the cache; from the programmer's point of view, he/she is accessing memory.

Registers are definitely inside the CPU; a cache might be inside the CPU or might be just very near the CPU. Access to a register is much faster than access to a cache, even if the cache is inside the CPU. A cache has many more gates and thus a longer cumulative gate delay. Thus, even if the programmer had the capability of storing an item in the cache---again, keep in mind that he/she does not have this capability---it still would be better to store the item in a register.

Note that the C language offers the register type for this purpose. It is a request by the programmer that the compiler try to assign a register for storing this variable instead of storing it in memory. The compiler does not have to honor this request, and indeed might not be able to, since the number of registers in a CPU is typically much smaller than the number of variables we have in a program.

p.546:

In systems with caches, memory is considered to be divided into chunks called blocks. The number of words per block is called the block size.

In the notation used in this book, blocks are numbered consecutively, 0, 1, 2, 3, ... Suppose for example the block size is 2 words and word size is 32 bits, i.e. 4 bytes. Then you should confirm for yourself that block 2 will consist of words 16 and 20.

You should also confirm something else: Suppose (just for example) address size is also 32 bits, and within an address of a word, let bit 0 be the least significant, bit 1 be the next-least significant and so on. Then confirm that:

bits 31-0 will be the word address
if we number the bytes within a word as 0, 1, 2 and 3, then the address of byte k within a word is the word's address + k
bits 1-0 of the word address will be 0s
bit 2 will be the position of the word within its block (0 means the first word in the block, 1 means the second word)
bits 31-3 will be the block number

You should think about the effects of changing these parameters. For instance, how do the above statements change if the block size is 4 words?

The authors refer to both a cell in the cache and a same-sized portion of memory as "blocks." However, it is more common, especially in industry, to speak of the first as a "line" and the second as a "block." So we say, for example, that block 9 in memory maps to line 1 in the cache. The world "line" stems from the fact that in pictorial form it looks like a row, i.e. a line, in a table, as in Fig. 7.10.

The expression near the top of p.546 would be better written as

Line number = (Block number) mod (Number of lines in the cache)

It is easiest to understand pp.546-548 by interpreting it as meaning that the authors are assuming one word per block. Note that this also assumes that consecutive words have consecutive addresses (i.e. no addressability of individual bytes within a word).

The mapping function used on p.546, in which the line number is defined to be the word address mod 8, is an example of the "hash function" concept you learned in ECS 110. Just as a hash function can be used to determine where to look for/put a record in a database, by plugging a key into the function, it is used here to decide where to look in the cache. Our hash function on p.546 is equal to the key mod 8, where our "key" is the address of the word we wish to access.

But just as there is no single hash function which would work for a database application, there is no single hash function which would work here in the cache setting. Any hash function would work as long as it has the property that for any cache line, some word address maps to it. (Otherwise some lines of the cache would never be used, and would go to waste.)

p.549:

Why are there no wires leading from the "Byte offset" part of the address in Figure 7.7? To understand this, let us first recall the example of p.546, especially the remark that our hash function must have the property that every cache line must get mapped to by at least one address.

The important point for this new cache, on p.549, is that now we are dealing with a byte-addressable machine. Each word consists of 4 bytes, and each byte has its own address. Word 200, for example, consists of bytes 200, 201, 202 and 203.

Our cache, though, does not deal with individual bytes. It handles only complete words. So, even if we had the Intel instruction

MOVB AX,[205]

which copies byte 205 to the AX register (note that this instruction is MOVB, which is a byte form of MOV), if byte 205 does not have a copy in the cache, then the cache will fetch the entire word 204 (bytes 204, 205, 206 and 207) from memory.

In other words, if we were to design this so that our "key" were the address, then the key would always be a multiple of 4. You should check that this would mean that only 256 lines in the cache would get mapped to, thus wasting 768 lines.

So, the designers of the cache here have decided to use the word address div 4---the word address without its last 2 bits---as the key.

Also, think of how the other subfields in the address were chosen. The designer wanted to have the cache consist of a total of 1K words (this is dictated by how much space we have available for the cache), and with line size of 1 word. That implies 1K lines, which in turn implies that the Index field in the address consists of log2(1K) = 10 bits. The remaining upper bits then form the Tag.

p.557:

Note that the caches in Figures 7.8 and 7.10 are both the same size, in the sense that both contain 16K words, so we are devoting the same space to each. But the latter cache has them in a more concentrated manner, in clumps, whereas the former can store widely-scattered items. The latter cache is designed because we anticipate that programs run on this machine will typically have a large degree of spatial locality, though it may not perform well on programs having a low degree of spatial locality.

Since the cache here will have 16K words with 4-word lines, that means that we will have 16K/4 = 4K lines. That in turn implies that the Index field in the address will consist of log2(4K) = 12 bits. Bits 1-0 specify byte position within word and thus are irrelevant to the cache (see above); bits 3-2 specify word position within a block/line. Thus for our Index bits we take the next 12 bits, 15-4. The remaining upper bits then form the Tag.

p.555:

Note that these data on cache misses come from running gcc and spice through a simulator of the DEC machine. A real cache does not tabulate data on misses for us, so we must resort to a simulator.

pp.560-561:

First, what about the words "latency" and "bandwidth"? These are very important words in the computer architecture and networks fields.

If I am reading a set of bits sent from some source (in this case memory), the time until I receive the first bit is called the latency, and after that point the number of bits sent per unit time is called the bandwidth. The latency is the time taken for a single bit to go from source to destination, while the bandwidth is the rate at which I can get bits to enter the data path.

Note that the bandwidth also depends on the width of the data path. If for example on each data line in a bus I can put 1 million bits per second,

DRAMs have a long latency, but due to page-mode access (p.B-33) or similar modes, the bandwidth may be fairly good. For an analogy, imagine a very long but very high-pressure garden hose. If you turn on the faucet, it will be a long time before the first drop of water comes out of the nozzle (long latency), but once the water starts coming, it will come out at a high rate (high bandwidth).

(Review the memory-system example in our digital logic handout before reading further.)

On p.560, it says to assume a one-word-wide bank of DRAMs. By this the authors merely mean that the width of the data bus portion of the system bus is the same as the word size used by the CPU. Suppose that the word size is 16 bits (and there is no individual byte addressability within a word), and that we wish to have 8M words of system memory, comprised of DRAM chips which are 4Mx1 in size. Then you should check that we will need 32 chips, 16 lines in our data bus and 23 lines in our address bus. The authors apparently assume in picture (a) that we are using high-order interleaving.

Here is the reasoning behind the figure of 1 + 4 x 15 + 4 x 1 = 65 cycles to process a cache miss:

First, note that the four words in the requested memory block all have the upper 23-2 = 21 bits of their addresses in common. The cache will send those 21 bits to a memory controller between the bus and the DRAMs, taking one clock cycle.

At that point, the memory controller requests the four consecutive words of the block from the DRAMs, each request taking 15+1 cycles (15 cycles for delays within a DRAM, 1 cycle to get back to the cache via the data bus). Hence a total of 65 cycles.

So far, we have been discussing picture (a) on p.561. What about picture (b)? To use the water hose analogy, imagine that we have four faucets, and four garden hoses, all working at once. This would be like having a data bus of 64 lines instead of 16. Here we would still need 32 chips, with somewhat different connections to the bus lines. (You should think about what they would be.)

The memory controller now makes four simultaneous requests to the DRAMs. Thus the term 4 x 15 we had above now is only 1 x 15, since we now would pick up 4 words in just one operation, and the 4 x 1 becomes just 1. Thus our cache miss penalty would be 1 + 15 + 1 = 17 cycles.

Picture (c) involves a low-order interleaving arrangement, now using a 16-bit data bus again. The same address is sent to all four DRAMs, which will again result in reading the four consecutive words in a block simultaneously. The four chips all get their words at the same time, so the term 4 x 15 in (a) again reduces to 1 x 15. But due to the one-word width of the data bus, the four words must be sent to the cache in four separate (but consecutive) data cycles. So, the overall cache miss penalty is 1 + 1 x 15 + 4 x 1 = 20.

pp.564-565:

This material is very poorly written, a shame since the examples are so good and relevant. It's best to proceed straight to the example at the bottom of p.565.

p.565ff. ("Calculating Cache Performance"):

The term "CPI" stands for "cycles per instruction," meaning the average number of clock cycles instructions need on a given machine. It is a measure of speed of a CPU (though many other aspects need to be examined in assessing CPU speed).

We run a program for a period of time (maybe to the end), and let I denote the count of the number of instructions executed. (The quantity I will cancel out in the end.)

"Loads and stores" refers to memory reads and writes. Thus 36% of gcc instructions access memory.

pp.568-569:

The problem here is that the direct-mapped approach to cache design is too restrictive. Consider the following C code, for instance:

for (I = 0; I < 1000; I++)
   Sum += X[I];

Suppose that the memory locations storing the variables Sum and I map to the same cache line. Then as the loop progresses, each of these two variables will repeatedly take turns evicting the other from the cache.

The other extreme is the fully-associative scheme, in which a memory block can have a copy in any line in the cache. This entail extra overhead (e.g. many more comparators) which take up precious space in the cache and thus allow fewer lines in the cache, and will make the cache slower.

Thus a compromise was developed, called set-associative. The lines are partitioned into sets (in Fig. 7.15, 2 lines per set). A memory block is mapped to a set, but it could have a copy in any of the lines within that set. The mapping is given by

set number = block number mod number of sets

Note carefully: A line can still consist of multiple words.

The figure is pretty good, except that the green parts are misleading, seeming to imply that the given block MUST be placed in the indicated line. This is true in the direct-mapped example, but not true for the set-associative one (the block have a copy in either line 0 or line 1, not just the latter as shown) or the fully-associative one (the block could have a copy in any of the 8 lines).

(Note that pictorially a line no longer corresponds to a row in a table.)

p.574:

Figure 7.19 is a nice picture, but it needs to be explained more fully: The "vertical axis" is sets, while the "horizontal axis" is blocks/lines within sets. Block/line size is 1 word. The least significant two bits are a byte offset as on, say, p.557.

The remark about in the caption about Output Enable can best be explained in terms of a RAM's Chip Select pin. Suppose we designed the data portion of our cache to consist of 4 SRAM chips. (For simplicity, suppose each chip is wide enough to hold a system word.) Setting it up using low-order interleaving, the i-th chip would contain all addresses which are equal to i mod 4. That would mean that one set, which is 4 words, is spread across 4 chips. So, what the caption is saying is that we can feed the output of the i-th comparator in the figure into the CS pin of the i-th chip. All of the chips would be connected to a bus (a "private" bus, set up specifically for the cache). Then we would not need a MUX.

p.575:

There is no byte addressability here, i.e. words are numbered 0, 1, 2, ... instead of 0, 4, 8, ... Thus there is no byte offset. Also, the block size is 1 word, so there is no block offset either.

p.576:

In the replacement scheme described here for 2-way associative caches, each block/line within a set has a bit, called a reference bit, which will record "recent use," in the following way:

Both reference bits start out at 0. When a block/line is accessed, its reference bit is set to 1, and the reference bit for the other block/line in that set is cleared to 0. So, the block/line whose reference bit is 0 is the least-recently used one.

Suppose a memory block which is mapped to this set is requested but found not to be in the set (i.e. a miss). Then one of the two lines has to be replaced. Under the LRU scheme, the one to be replaced is the one whose reference bit is 0 (or either line if they are both 0).

Let's change this a little, designing things so that both bits start out at 1, and an access to a line clears the reference bit for that line and sets the reference bit for the other line. When a cache miss occurs for this set, the line whose reference bit is 1 is the one which will be replaced.

This new scheme is not different at all. It is merely a change of labels; in the first scheme 1 meant "accessed" and 0 meant "not accessed," and now we have merely reversed those labels. But the advantage of the new scheme is that it generalizes better for degrees of associativity greater than 2:

Give each line within a set an "age register," indicating how long it has been since this line has been accessed (measured in overall accesses to that set). Whenever an access occurs, all age registers in that set are incremented by 1 (the lines have "aged" by 1 time unit), except for the register for the line which was accessed, which is cleared to 0 (meaning it was accessed 0 time ago). If upon incrementing some age register reaches the maximum value storable for that register (e.g. 3 for a 4-way associative cache), all age registers are cleared to 0 (or you could design it to stay at the maximum value) .

p.579ff.:

Motivations for virtual memory (VM):

Remedying lack of memory: We may not have enough memory for a program, or in the timeshare case a group of programs. In the old days, a programmer would have to write into his/her code provisions for bringing parts of the program in to memory from disk when needed, evicting them from memory when not needed---a major inconvenience. Similar---and worse---problems arose when several programs were to be in memory at once, with collective size greater than that of memory.
Relocation: The compiler will produce machine code which assumes that the program will be loaded into memory starting at 0 (or some other specific location). But with several programs in memory at once, this assumption cannot be true. In the old days, the operating system (OS), when loading a program into memory, would "relocate" it, by actually changing the machine instructions to reflect the new loading point.
Protection: In a timeshare system, we must have a mechanism to prevent one program from accidentally writing to the memory used by another program.

Virtual memory can provide all of these things.

The word "virtual" means "apparent." If, say, we have an int variable in our program named X and &X is 200, it will appear that X is stored in location 200, but it might be actually stored in (say) location 5000 instead. The VM hardware provides a translation from the virtual address (200) to the physical address (5000).

Thus the OS can load a program anywhere in memory, wherever there is space (including putting different parts of the program in noncontiguous spaces). This solves the relocation problem.

Furthermore, the VM hardware will keep track of whether a given part of a program is in memory or out on disk, and in the latter case, bring it in to memory from disk. This solves the problem of not having enough memory, all while being transparent to the programmer; it will appear that the entire program is in memory, even though it isn't.

Finally, for each chunk of a program, the VM hardware not only keeps track of where the chunk is, but also maintains access permissions, similar to Unix file permissions. If a program has, say, an errant pointer which causes it to try to access another program's memory without permission, the virtual memory hardware will catch this and abort the program.

Note that, in spite of the huge emphasis on CPU speed, adding VM actually slows things down. The VM hardware needs time to do the address translation, for example. But VM is so convenient that we tolerate the slowdown.

p.582:

Look at the first bulleted item here, which discusses the implications of relative speeds of various disk operations. To explain this, let's first review how a disk works.

A disk consists of a rotating platter having many concentric rings, called "tracks." There is also a read/write head, which moves along a bar radiating from the center of the disk. If we wish to read or write a section of the disk, we move the read/write head along the bar until it gets to the proper track---this is called a "seek"---and then wait for the section to rotate around to the head---this is called the "rotational delay"---and pass under the read/write head, when the read or write ("data transfer") occurs.

The seek time and rotational delay are very slow relative to CPU speeds, but the data transfer time is relatively fast once it gets started. (This should remind you of the "garden hose" analogy above.) The point the authors are making in the first bullet on p.582 is that the seek time is so long that, in order to make the whole operation worthwhile, we should grab a lot of data when we go to disk. Thus the page size should be set to be large.

Similarly, since a page fault has such a catastrophic effect on system speed---note that they say a page fault takes millions of cycles to process---the system designer should expend a lot of effort on making a really efficient policy for choosing which page to replace (like LRU for caches). Certainly whatever policy used should be fully associative, to minimize the page fault rate.

p.583:

The authors say that "Each program has its own page table," but this is misleading for a couple of reasons:

First, the phrasing makes it sound like every time I run a given program, say gcc, then gcc will always have the same page table. That isn't true. Instead, when we run a program, the operating system must find places to load it which are currently unused; these will vary from one one of a program to another. (Note that I say "places" instead of "a place" here because the pages of the program could be loaded in lots of places scattered around the memory.)

Second, the authors' phrasing does not take into account the distinction between programs and processes. For example, if three users are currently running gcc on a certain machine, that is only one program but it is three separate processes, each of which will have its own different page table. (The page tables will at least be different with regard to gcc's data sections, though they might all share the same pages for gcc's code.)

p.584:

Note that the Page Table Register is in the CPU, with all the other registers.

Note that the virtual and physical addresses in this picture are byte addresses. If a whole word is being read from or written to (the usual case), then the address given will be that of the lowest-numbered byte within that word, as was the case for caches. (Review the material concerning p.540 above.)

Note too that the picture is somewhat of an oversimplification, since it implicitly assumes that all pages of the program are resident in physical memory, when in fact some are probably out on disk.

p.586:

Here is a complete (though simple) example:

Suppose our C source file for a program contains the statements

 
int x;

x +=3;

The compiler might assign x to location 1200, and produce the instructions

MOV CX,[1200]
ADD CX,3
MOV [1200],CX

and might assign the first of those instructions to location 3024. Both the 1200 and the 3024 are virtual addresses. (Recall that we don't want the compiler writer to be burdened with having to know where in memory the instructions and memory will be.)

Say that the virtual address 3024 is actually physical address 9024. In other words, the virtual address 3024 is here the physical address 9024. Suppose also that the page offset is given by the last two digits, in this case 24; then virtual page number 30 corresponds to physical page 90.

What actions will occur?

Suppose the page table register points to 1000, i.e. this process' page table begins there. (This will be a physical address.) Assume 4-byte words, so that the entries to the page table are at 1000, 1004, 1008, and so on. Suppose also that the page-fault handler in the OS is at 22000 (again a physical address). Suppose there is no TLB (a cache used expressly for the page table).

First, the instruction fetch: The CPU's PC register will contain 3024. The virtual memory (VM) hardware will separate out the page number, 30, and look at entry 30 in the page table. The latter will be at location 1000 + 4*30 = 1120. Thus the hardware will put 1120 on our address bus and do a memory read.

Memory will return the requested word, which is the page table entry for virtual page 30. That entry will consist of a Valid bit and another number. Virtual page 30 is currently in memory, so the Valid bit will be 1, and the rest of the entry will be 90. So now the hardware knows that the instruction is actually at 9024, not 3024, and it will place 9024 on the address bus and fetch the instruction.

Next the CPU notices that the source operand of the instruction is memory location 1200. That is virtual page 12. The hardware looks at entry 12 in the page table as above. Suppose the Valid bit turns out to be 0. Then virtual page 12 is not currently resident in memory.

The hardware causes an internal interrupt, placing 22000 in the PC to run the OS. The OS looks at the rest of the page table entry for virtual page 12 to see where on disk this page is. The OS decides which currently-resident page to evict to make room for virtual page 12, say virtual page 509. If the Dirty bit for that page is set, the OS must write the entire page back to disk.

Then the OS reads virtual page 12 from disk, and updates TWO entries in the page table---the one for virtual page 12 (its Valid bit will now be 1, and the rest of the entry will be the old physical page for virtual page 509), AND the entry for virtual page 509 (whose Valid bit will now be 0, and the rest of the entry will now be the disk location for this page).

p.587:

The Reference Bits work just like those described earlier for caches. The question, though, is whether these are provided by the hardware; on some machines they are not, and the OS must somehow simulate them, as follows.

Recall that processes take turns, each running for a short time then relinquishing its turn. The OS runs in between turns, saving the registers of the process whose turn just ended, choosing the next process to run, and then restoring the registers of that new process (including the PC register, which makes the new process run); this activity is called a "context switch."

Every n-th context switch, the OS will choose a cluster of pages of main memory (this is done in a rotating fashion, so that eventually the OS looks at all pages in each cycle around the memory), and clear all their Valid bits in the page table. Actually, those pages are valid, but the OS deliberately marks them as invalid. When such a page is referenced by a running process, this causes a page fault, so the hardware transfers control to the OS as usual, but the OS remembers that this was a special page that really is valid. So the OS now sets the Valid bit for that page.

The OS then treats the Valid Bits as Reference Bits! (Here the convention is 1 means recently used, 0 means not.) Of course they are not really Reference Bits, but they do serve as reasonable approximations to them. To see this, let us suppose for simplicity that the OS cleared all of the Valid Bits to 0, and then think about the situation which would occur, say, four memory accesses later (say all four accesses are to different pages). Then four of the Valid Bits will be 1s, and indeed these are the four most-recently accessed pages, whereas the ones with Valid Bits equal to 0 have not been accessed for a while.

p.590:

Note that the authors say that a tag in the TLB consists of only "a portion of" the virtual page number. They are referring here to the degree of associativity. If for example the TLB is fully associative, then the tag will consist of the entire virtual page number. If on the other hand the TLB is direct-mapped and consists of 2^m lines, then the lower m bits of the virtual page number will be used as the index into the TLB, with the remaining upper bits being the tag.

p.591:

The TLB is in (or very near) the CPU. The page table not only has its entries pointing to physical memory, but also the table is part of physical memory.

pp.640-641:

Here again they are referring to latency vs. bandwidth issues. For example, on p.640, Item 1 refers to bandwidth while Item 2 is basically about latency. (The latter might be better phrased in terms of how many I/O operations we can initiate per unit time.

Let's illustrate this with the idea of "RAID"---Redundant Arrays of Inexpensive Disks. RAID systems come in various degrees of data replication.

On the one extreme, we can have "mirrored" disks, in which all disks contain exactly the same data. In this case the latency for reads can be very good, because we can assign a given read request to whichever disk would have the shortest seek time. (But since a write must be performed on all disks, the write latency is worse for a mirrored system than for a single-disk system.)

On the other extreme, we can have fully "striped" RAID systems. Suppose for instance we have 16 disks. Then for any given logical sector of a file, we can store 1/16 of the sector on any disk, and read/write them all simultaneously..

p.642:

These various I/O benchmarks are very important in assessing the speed of an I/O system. They are featured heavily in the sales advertising of large I/O systems.

The term "supercomputer" generally refers to "vector" computers, which as the name implies, include array operations in their basic instruction set. In other words, such a computer would have not only an ADD instruction to add two integers but also a VADD instruction to add two arrays of integers. Of course, even an ordinary computer can add two arrays together by putting the ADD instruction inside a loop, but with VADD we can do it faster. (Faster because VADD would have only one instruction fetch, it would be heavily pipelined and so on.)

The speedy array computation capability is useful in scientific applications, e.g. large computations involving linear algebra or numerical differential equations. This leads to large I/O needs for the reasons given in the text. If for instance we have a matrix with 1,000 rows and 1,000 columns then to record a snapshot at a given point in the computation would mean writing 4MB of data to disk.

The term "transaction processing" refers to applications like those performed by banks, billing departments of major companies, and so on. For example, each check processed by a bank would be considered one transaction. Since the amount of data associated with processing one check is rather small, latency is a much bigger factor than bandwidth in purchasing or designing such a sytem.

p.647:

As the text points out, the average seek time plays a big role in marketing a disk drive, but it is based on assumptions and thus can be very misleading.

The computation of "average" seek time assumes (a) all tracks are equally likely to be requested and (b) consecutive requests are statistically independent. Suppose for example we have a 100-track disk and that it takes 1 ms to go from one track to the next one. Then the average seek time would be computed as follows. Say the one request is to track X and the next request is to track Y, so that the seek time is |X-Y| * 1 ms. The probability of any particular (X,Y) pair, say X = 34 and Y = 89, is 0.01 * 0.01 = 0.0001. Then the average seek time will be

100
�
i = 1

100
�
j = 1

0.0001 |i-j|

If you compute the sum, you find that it turns out to be about 33. So the average seek time would be about 33.

But as mentioned in the book, due to locality of reference, assumption (b) above does not hold; X and Y may not be independent. Locality of reference would make X and Y would have a fairly strong positive correlation. (On the other hand, most of the locality of reference might be caught by the disk cache, in which case the independence assumption might be pretty accurate after all.)

p.661:

The key point to understanding the figure is that it assumes that there are not separate data and address buses; the data bus is used for both address and data.

Note that although this figure assumes that an I/O device is communicating with memory (through Direct Memory Access, as discussed on p.682), you can understand the figure just as well by thinking of a CPU communicating with memory.

The "split" (hexagonal-shaped) lines in the portion of the figure representing the data bus symbolize 1s and 0s being sent.

pp.665-666:

Note that the authors' use of the term "latency" here is not consistent with their earlier usage or generally accepted usage of the term.

Access time (for a read) means the duration measured from the time an address becomes valid on a memory chip's address pins to the time the chip places a valid response on its data pins.

Characteristic 4 ("A memory access time for the first four...") is referring to the block transfer capability of this memory system. In this mode, we send the memory the address of the first word of the block we want to read, and then memory gives us the rest. The access time for reading 4 words is 200 ns, 220 ns for 8 words, 240 ns for 12 words, etc.

In steps 3 to 4 on p.666, recall that due to block transfer mode, the 4-word read will take only 20 ns, i.e. 4 cycles.

The "split-transaction" approach described in the Elaboration is very useful for shared-memory multiprocessor systems, in which many CPUs accessed shared memory via one shared bus. The bus is the main bottleneck in such a system, so good utilization of the bus is crucial. The figure of 272 cycles comes from 16*(1+4*(2+2)).

p.667ff:

Each I/O device (keyboard, mouse, etc.) in a computer system has one or more "port" numbers. In a system which uses "memory mapped I/O," these ports appear to the CPU as ordinary memory (which they aren't), which allows the CPU to communicate with them using ordinary MOV instructions.

Say our keyboard data port is number 50. To read a character from the keyboard and place it in a memory buffer at 1000, we might write the assembly code

MOV R2,[50]
MOV [1000],R2

where R2 is a register in the CPU.

Machines which have non-memory-mapped I/O have special instructions like IN and OUT. The above code might look like

IN R2,[50]
MOV [1000],R2

I/O is done using one of the following three methods:

Polling: We write a loop in which the CPU keeps reading an I/O port until a character is ready.
Interrupt-Driven: The CPU works on some other program. When the character is ready, the I/O device will send a hardware signal to the CPU. The CPU temporarily suspends the other program, jumps to code which processes the new character, and then resumes execution of the other program.
Direct Memory Access: The I/O device writes the character to memory directly, with almost no role being played by the CPU. All the CPU does is initiate the data transfer, by sending a signal to the DMA controller. The DMA controller sends an interrupt to the CPU when the transfer is done.

Polling is the simplest but the least efficient of the three methods. We are tying up the CPU in wasted wait time. For a timeshare OS, the last two options are the best; when one process is "sleeping" due to waiting for an I/O operation to complete, the OS runs some other process in the meantime.

DMA is best for large, high-bandwidth data transfers. As the book points out, we would like for the CPU to not be involved, because we want to allow the CPU to work on something else during I/O. Moreover, note that in the other two methods we have two bus transactions for each I/O action, compared to just one with DMA.

p.669:

Bus arbiters may consist of special chips, such as the Intel 8259A chip used in PCs.

p.671:

The term "...just a level" means "just a 1." In other words, we watch for the 0-to-1 transition, not a level of 1. A lower-priority device might currently be using the bus, with a 1 on the grant line, when suddenly a higher-priority device wants to use the bus. If it were to merely check for a 1 on the grant line, it would mistakenly usurp the other device's possession of the bus.

p.672:

The SCSI controller, in transferring data from the I/O bus to the processor/memory bus, temporarily becomes a bus master on the latter.

p.673ff:

Note the crucial role that interrupts (hardware!) play in making an OS work:

Interrupts from a timer are used to trigger the end of a turn for one process and the start of another. (See our OS handout.)
Interrupts from I/O devices allow the OS to run another process when one must "sleep" to wait for I/O (see above).
Internal interrupts, called "traps," cause the shift from user to system mode (see below).

Recall that application programs do not ordinarily do their own I/O; they ask the OS to do it for them. If for example our program calls printf(), that in turn calls write(), a function in the OS, which does the actual output operation for us.

In a timeshare OS, we want to make sure that application programs cannot do I/O directly. How can we accomplish this?

Recall that one of the uses of virtual memory is to protect a process' memory from accidentally being written to by other process. If our system uses memory-mapped I/O, then the OS simply sets things up so that the "memory" occupied by a port is not accessible to all the non-OS programs.

In a non-memory-mapped I/O machine, we can design the CPU to work in two modes, user mode and system (OS) mode. A 1-bit register in the CPU indicates which mode we are in. We can design the CPU so that it is only possible to execute IN and OUT in system mode. We can design the OS so that system calls like write() are implemented as TRAP instructions. These are internal interrupts, acting mostly like ordinary CALL instructions except that they also change the mode to system mode. Then after the system call, we can return to the calling program using IRET (return from interrupt) instead of an ordinary RET; the former resets the mode register to user mode.

Computer Organization and Design: The Hardware/Software Interface David Patterson and John Hennessy Morgan Kaufman Publishers Second Edition, 1998

Computer Organization and Design: The Hardware/Software Interface
David Patterson and John Hennessy
Morgan Kaufman Publishers
Second Edition, 1998