Supplement to Dandamudi Text
By Norm Matloff, UC Davis

Important note: In order to represent the NOT operation in this ASCII file, a "prime" symbol ' will be used in lieu of an overbar.

p.45:

The notion of complete sets is of more than just theoretical interest. The completeness of {NAND}, for instance, means that for some simple circuits, a single 7400 chip (p.50) can be used to implement the entire circuit. This is noted later, e.g. p.75.

p.50:

Note that building circuits from these 74xx chips is feasible only for small circuits, and these chips are infeasible even some small circuits, as we often need to put a circuit on just one chip (in order to save space in the application, say a digital camera, or to save manufacturing cost). Alternatives include for example gate arrays, to be covered in Chapter 3.

p.57:

Each term in a sum-of-products expression is called a minterm. If the expression is in n variables (n = 3 in the examples on this page), then each minterm is a product of n factors.

Here is a bit more detail on how the product-of-sums form is derived. Look at the majority-function truth table on p.51. Based on the rows in which F1 = 0, we see that

```F1' = A'B'C' + A'B'C + A'BC' + AB'C'
```

Then, using the involution and de Morgan properties, we have

```F1 = (F1')'
= (A'B'C' + A'B'C + A'BC' + AB'C')'
= (A'B'C')'(A'B'C)'(A'BC')'(AB'C')'
= (A+B+C)(A+B+C')(A+B'+C)(A'BC)
```

which is what is shown for the majority function at the bottom of p.57.

p.62:

It's important to understand WHY the Karnaugh method works. Toward that end, let's look at the 2-cell oval labeled "AB" in Fig. 2.16(a). The 1 in the left cell represents that the minterm ABC, and the 1 in the right cell represents ABC'. Back on p.60, the sum corresponding to those minterms was reduced by algebraic manipulation:

```ABC+ABC' = AB(C+C') = AB
```

The reason that a Gray Code was used in the Karnaugh map (00, 01, 11, 10 for the horizontal labels in Fig. 2.16(a)) was so that these two combinable terms, ABC and ABC', would be adjacent to each other in the Karnaugh map. This is why we can combine two adjacent 1s. All the K-map is doing for us is making it convenient to find combinable terms.

After combining, it may be the case that some of the results of combining can then themselves be combined. For example, look at Fig. 2.19(a). First, the two 01/01,11 cells can be combined, corresponding to

```A'BC'D + A'BCD = A'BD
```

ABC'D + ABCD = ABD

But then, if we were doing it algebraically, we would do

```A'BD + ABD = BD
```

and in the K-map this corresponds to grouping that entire 2x2 square in the middle of the map.

This now makes it clear why K-map groupings must be of size 2k for some k. We get groupings of size 2 by combining two minterms; we then get groups of size 4 by combining two of the combinations of size 2; then get groups of size 8 by combining two of the combinations of size 4; etc.

p.74:

The author should have mentioned here that we may be able to exploit commonality in multiple-output situations like this. In the example here, note that AB must be computed in both F1 and F2. We could then split the wire coming out of that AND gate, for use in both functions.

p.75:

Near the beginning of Sec. 2.10.1, the author notes that (A'B)' can be implemented using two NAND gates. In case you don't see why, here is why: By definition, a NAND gate takes inputs x and y and outputs (xy)'. So, by setting x = A' and y = B, we see that a NAND gate will computer (A'B')' from inputs A' and B. Then the only remaining question is how to produce A' from A using a NAND gate. But that was explained in Fig. 2.3.

p.85:

Look at Fig. 3.3. We know that the function takes on the values 0,0,0,1,0,1,1,1 when (A,B,C) = (0,0,0), (0,0,1), (0,1,0), ..., (1,1,1), respectively. So, for example, when (A,B,C) = (0,1,0), we want to select the value 0 for F1. That fits right in to what a MUX does--select. So, we feed (A,B,C) into the control pins of the MUX, and feed 0,0,0,1,0,1,1,1 into the data pins.

p.86:

The discussion on the E' input to the 74151 is confusing. Basically E' is acting like a chip select. Set E' = 0 when you wish to use to use this chip. Setting E = 1 would just make the output O = 0 no matter what the Ij are.

To understand Sec. 3.2.2, review the material on p.85 first. Here on p.86, we will do something similar, except that (a) there will be one fewer variable used in the control pins of the MUX, and (b) the data inputs to the MUX will now consist of that deleted variable and its negation, and 0 and 1. This is exemplified in the MUX part of Fig. 3.5. The control inputs are now just A and B, and C has been deleted. The data inputs are now C, C', 0 and 1.

You can see how this works by examining the two truth tables in that Fig. 3.6. In the original truth table, whenever (A,B) = (0,1), we find that the function is the opposite of C, i.e. C'. So, in the new truth table, the entry for A = 0, B = 1 is C'.

p.89

The 74138 chip discussed here is the same one you have been using in Chipmunk. It is easiest to understand this chip first as a 3-to-8 decoder, and then later as a demultiplexer.

Used as a decoder, the pins I0, I1 and I2 take the three-bit data input, and one of the output eight pins O0', O1',...,O7' is 0 and the others are all 1s. (Recall that the output is active-low.)

The one confusing part is that this chip has not just one, but three (3) enable pins. The reason for this is to facilitate using several of these chips in concert in a larger system.

For example, here is how we could construct a 4-to-16 decoder using two '138s, which I will refer to as A and B. We would tie A's E2 pin to 1, and B's E0' pin to 0. The E0' pin in A and the E1' pin in B would be tied to a common line which would serve as the enable (active-low) for our 4-to-16 decoder package. Finally, and most importantly, our fourth input to that package, I3, would be fed into the E1' pin in A and the E2 pin in B. This last sentence means that if I3 is 0, we use chip A, while if I3 is 1 we use chip B. And we would (in our minds) relabel the output pins of B to O8', O9',...,O15'.

p.93:

Here is an example of the use of a priority encoder. An interrupt controller chip might have one Interrupt Request (IRQ) pin for each I/O device in the system. and then tell the CPU which one of several requesting devices should be granted service.

p.96:

Note that in computing overall gate delay, what matters is the number of stages, not the number of gates. There are three stages in Fig. 3.18b, even though there are five chips. For example, the two ANDs (and the second XOR) operate in parallel, and thus only contribute one gate delay, not two.

p.97:

Note that the key to the speed of a carry lookahead adder is the use of AND gates with fan-in greater than 2.

p.99:

Note that the AND array produces the minterms, and the OR array ORs them together.

p.114:

Another misuse of the word finite, very common in engineering.

p.117:

This is a little misleading. What is described here is a master-slave J-K flip-flop. The author makes it sound like any J-K flip-flop is master-slave, which is not true.

The real definition of J-K is that it operates like S-R (J=S, K=R), except that the combination J=K=1 is allowed. The latter situation makes Q toggle, i.e. switch from 1 to 0 or vice versa.

A master-slave flip-flop (whether J-K or not) has the structure in Fig. 4.10a. The key point is that the master changes only when the clock is on, and the slave changes only when the clock is off.

The point of this is that it avoids feedback problems. For example, consider the Intel instruction

```addl %eax, %ebx
```

This adds the contents of the EAX register to the EBX register, so EBX is both a source and destination. Say there is 1 in EAX and 8 in EBX. The new value in EBX should be 9. But if we let the inputs to EBX open too long, that 9 will flow out onto the bus, get added to 1, producing 10, and then the 10 will enter EBX, a disaster.

p.120:

Note how important it is to have edge triggering or master/slave. Otheriwse, for example, the bit originally in the leftmost J-K could be copied not only to its immediate neighbor on the right, but also to the J-K two positions to its right, etc.

p.121:

Dandamudi mentions that the flip-flops, e.g. in Fig. 4.12, would have to be edge-triggered (or master-slave). Make sure you understand why. For example, name the four J-Ks in that figure D, C, B and A. When the clock pulse comes in, what's in D will now go into C, which is what we want. But if we're not careful in the design, the new value in C will be copied to B. Again, to avoid this, we need to make sure that each flip-flop will accept data only during a very narrow window of time.

pp.128-129:

There's a lot of stuff here, much of it superfluous, so it's confusing, but here is the essence:

Our goal is to simplify the circuit in Fig. 4.15. The end results turns out to be Fig. 4.19. The latter is indeed a simplification of the former; two ANDs have been reduced to one (and only a two-input AND, at that). We will achieve this simplification via the application of Karnaugh maps.

Note (in Fig. 4.19) that Dandamudi has named the stored values (i.e. the Qs) in the three bits C, B and A. Note also that there are three Js and three Ks, which he names JC, JB and so on in his charts on pp.128-129.

The central point is that, for any current (A,B,C) pattern, the Js and Ks must take on certain values in order that the new (A,B,C) is what it should be. In that sense, the Js and Ks are functions of A, B and C. That leads to truth tables which we simplify using Karnaugh (or other similar method).

For example, look at the row in Table 4.4 in which the present value of (A,B,C) is 011, i.e. the ripple counter has a present count of 3. The next clock pulse makes the count 4, i.e. 100 (as shown in the "Next state" column). The question is, what values do the Js and Ks need to have in order to make that transition to 100 happen?

Let's answer that last question for the A bit. This bit should change to 0, so the question is, what should JA and KA be to make this happen? Well, recall that in a J-K flip-flop, setting K = 1 will make the bit go to 0 when the clock pulse comes. But on the other hand, this transition of a from 1 to 0 is also a toggle, and we know that setting J = K = 1 will make that happen. So, we definitely need KA to be 1, and it won't matter what JA is. Accordingly, we see (1,d) in the (JA,KA) column for this row. The other entries are computed similarly.

So, we have six functions of A, B and C, shown in the last six columns of Table 4.4. In order to simplify them, Dandamudi has drawn the six Karnaugh maps on p.129. In the map for JA, he has found that a minimal set is JA = BC. That is then reflected in the AND feeding into JA and KA in Fig. 4.19. Note that he has exploited the "don't care" entries in that first map on p.129, treating one of them as a 1 and the rest as 0s.

pp.669-671:

Of course, we do not concern ourselves with the electronics here. The essence of having open collector outputs at chip pins is that it enables wired-AND, as shown in Fig.16.3a. By connecting the wires together from pins 2, 4 and 6, we get the AND of the three pins without actually using an AND gate. (We save a gate, but we do pay a price, in that wired-AND tends to have slower performance.)

So, even though the 7405 chip consists solely of NOTs, and thus would not seem able to give us NOR, wired-AND does allow us to use this chip to produce NOR, because NOR(I0,I1,I2) = (I0+I1+I2)' = I0'I1'I2', which we get from three NOTs (from the 7405) and one (wired) AND.

Here is how pairing the two 74170s would work. Consider the most significant bit (MSB) of a 4-bit word to be read from this 8x4 system. The bus will have 4 data lines, and the 4-bit word to be read will be copied to those lines, so for instance the MSB of the word to be read will be copied to the MSB line in the bus.

There would be 3 address lines in the bus, to indicate which of the 8 4-bit words. The address has an MSB too, of course, and that would determine which of the two 74170s is selected, as follows. The MSB address line would connect to the RE' pins of the two 74170s, one with a NOT gate in between and the other without such a gate. That means only one of the two 74170s would be read-enabled at a given time. The one not currently read-enabled would have its output in the Z, i.e. disconnected, state, as seen in Fig. 16.4d. So, we could actually connect wires from the MSBs of the two 74170s in a wired-AND manner; the one in Z state would be disconnected, so the other one would be copied to the MSB data line in the bus, as desired.

The point on signal drivers is that an output device like a 7-segment display may have different voltage requirements than the chip does. This is no problem with open-collector chips, for electronic reasons which don't concern us.

p.672:

By tri-state inverters the author means an ordinary tri-state buffer which has a NOT operation on its output.

p.673:

The 74373 is an 8-bit register (non-clocked) with built-in tri-state devices to control output.

p.674:

The first difference between Fig. 16.8 and Fig. 16.1 is that data inputs and outputs come through the same pins (D2, D1, D0). The use of tri-state devices ensure that we won't do input and output at the same time. The second difference is the addition of the CS' pin, to allow selection among several such chips.

p.677:

In Fig. 16.10, the top row is address 0 (A1 = 0), the bottom row address 1 (A1 = 1). Within each row, the left chip holds the high byte of the 16-bit word, and the right chip holds the low byte. Note the bubbles on the outputs of the decoder.

p.680:

Our system memory consists of 64Mx4 = 256M bytes, and the machine (like most of them) is assumed to be byte-addressable, i.e. individual bytes within a word have separate addresses. Since 256M = 228, we see that we need 28 system addresses lines, i.e. A0-A27.

This is a high-order interleaved system, so that two high bits of the address, i.e. A26-A27, will determine in which of the four rows of chips the given address is stored.

Now that we have selected the row, we must select which one of the 16M words we want within that row. Since 16M = 224 we need to devote 24 of our address bits to specifying which of those words we want. Address lines A2-A25 will be used for this purpose.

Once we get a word from a row of chips, the CPU will chose which of the four bytes of that word we want to access. That will be done using A0-A1.

p.683:

Up to this point, we have been concerned only with how to build a system of the desired size, which in the case on p.680 was 256M bytes. But from this point onward (not just in this chapter but also for instance in the cache chapter), we will be concerned with how to make the memory system fast. The memory is the weak link in our quest to make fast CPUs. No matter how fast the CPU is, it is useless if it must be delayed by waiting for memory. That delay comes from two sources (and more in some situations, as we will see later):

• It takes a signal some amount of time to travel from the CPU to memory and back. Even though the signal is traveling extremely fast, about 2/3 the speed of light, things in the CPU are happening so fast that the signal propagation delay does matter.
• Inside the CPU, the wires can be extremely thin, because there is no danger of them buckling. But the wires between the CPU and memory cannot be made so delicate, and are much thicker. In a thick wire, it takes longer to change a current level from 0 to 1 or vice versa.

On this page, we see why one contiguous set of four bytes might take longer to access than another.

I believe that the Pentium is actually configurable, so that it will perform an internal interrupt for certain unaligned data accesses. This causes a bus error in Unix systems.

pp.686:

In what Dandamudi calls "synchronized access," the four addresses share the same upper bits (all but the lowest two bits), so the addresses are consecutive. Those shared address bits come from the address bus as shown. Note that this means that if the CPU gives the memory four consecutive addresses but the first is not in the first module, then the set of addresses will "wrap around," and we won't be able to do the accesses simultaneously. This is like the alignment problem discussed above.

p.687:

In Dandamudi's "independent access," the four addresses presented to the memory could be nonconsecutive. It would be a little slower, in that the four addresses would have to be sent one at a time along the address bus.

p.694:

In the figure, note that the disk cache is typically software, i.e. in the OS.

The example of why memory is the weakest link is very good here. Spend extra time to understand it.

p.695:

Using SRAMs for the cache makes sense. SRAMs are faster than DRAMs, and in a cache speed is of the essence. The extra cost of SRAMs is not a problem, since caches are small.

Note that a DMA device is a special processor that handles I/O. The author makes an important point here about freeing up the system bus for DMA while the CPU relies on the cache, obviating the need to use the bus. Note, though, that that means we must arrange the OS so that the DMA will never be reading from or writing to a region of memory that is currently cached. One way to do this is to make DMA regions noncacheable; many CPUs, including the Pentium, do allow the programmer to declare a memory region noncacheable. This is brought up later, on p.723.

Pay special attention to the fact that cache size is only a very tiny percentage of memory size. Then think about how remarkable it is that cache miss rates are so low, due to locality of reference. Here are some measurements on a DEC 3100, for two popular programs:

 program instr. miss rate data miss rate overall miss rate gcc 6.1% 2.1% 5.4% spice 1.2% 1.3% 1.2%

(The data were collected on a simulator. Do you see why direct data collection would be impossible?)

p.698:

The notion of predictive loading mentioned by the author here does not make sense as it stands. Here is some elaboration:

When a cache miss occurs, without any special hardware the words in the incoming block won't come in any faster than if we got them one at a time. But if we have a low-order interleaved memory system, we can do a block transfer faster than if we read the words one at a time. Or, we can use page mode DRAM chips for our system memory, which essentially does low-order interleaving within the chip.

Some CPUs include instructions to do predictive loading independent of cache operations. In other words, the programmer can put an instruction in the program saying, "Please start fetching X, as I will need it a few instructions from now." By prefetching X while doing other work, we can mask the slow latency of the memory.

p.700:

Here in Sec. 17.5, we are addressing the first of three Big Questions concerning caches: When the CPU searches the cache for a certain memory item, where in the cache should it look? Note that "where" means "in which cache lines."

There will be three answers to this question. In the design discussed in Sec. 17.5.1, the answer will that for any given memory item, the CPU will look in only one cache line for that item. That particular line will be determined by applying a certain mod operation to the address of the item the CPU is looking for. On the other extreme, presented in Sec. 17.5.2, a given item could be anywhere in the cache, so the CPU must search everywhere. Finally, Sec. 17.5.3 shows the compromise design, in which a given item could be in any of a specified set of cache lines; that set is also determined by a mod operation.

p.703:

Spend extra time here, making sure you understand how the various bit fields work, and understand that they make sense. Here are some points that may help you in this understanding process.

Note that the upper t+c bits of the address form the block number, while the lower b bits form the byte number within block. We are determining the cache line number by taking the block number mod C, and that mod value is found by taking the lower c = log2C bits of the block number. That is the lower c bits of the t+c bits that form the block number, so it all works out.

In searching for an item in the cache, we compare the tag of the entry in the cache line with the upper t bits of the address of the item we are searching for.

p.709:

The key point is that the CPU needs to search all the cache lines simultaneously, so we need 2c comparators. If it didn't have to be simultaneous, we could feed all the tags into a MUX, and then select different tags using the MUX at different clock cycles, but this would be far too slow.

You can also see why a fully associative cache scheme is out of the question except for small caches. The hardware overhead would be far too great.

p.710:

If we have a k-way associative cache, we only need k comparators. All the sets of tags would feed into a MUX. The MUX would select one set (using the s set bits of the address), and the output of the MUX, consisting of k tags, would be compared to the address of the item we are trying to find.

p.711:

Here is the second Big Question: If the CPU does not find the item it wants in the cache, it will bring in that item from memory, so which cache line must be evicted to make room for the new one?

p.713:

Here is the third Big Question: When the CPU does a write to an item in the cache, should the CPU update the item in memory right away, or wait until later?

Note that a write-back policy works well for some applications, while write-through is better for others. For example, consider the code

```for (i = 0; i < 1000; i++) sum += x[i];
```

Under a write-through policy, the variable sum would be written to memory repeatedly, and yet the only update to memory that would really matter would be the last one. So, this code would run faster under a write-back policy. On the other hand, consider this code:

```for (i = 0; i < 250; i++) x[4*i] = 9;
```

Suppose block size is 4 words, and suppose x is at the beginning of some block. Then only the first word in each block is being written to. Yet if our cache were of the write-back type, we would waste a lot of time, since we would write back the entire block each time, even though only one word in the block had changed value.

p.715:

The author's 15% figure for writes seems low. It used to be that a mix of 2/3 reads, 1/3 writes was the figure everyone used, and Prof. Farrens tells me that this is still true.

Make sure you understand the need for write buffers if we use a write-through policy for our cache. When we write to one word, it would be quite a slowdown if we had to wait until that write reached memory. So, we set up a "waiting area" for pending writes, and these gradually make their way to memory.

p.718:

Here is a paraphrasing of the various miss types. Imagine a cache-designing leprechaun who is watching his cache work. Each time there is a miss, he takes it personally. Here is what he might say:

• Compulsory Miss: "Well, you can't blame me for that one, can you now?"
• Capacity Miss: "If only I had designed the cache to be larger."
• Conflict Miss: "If only I had made the cache fully associative."

p.719:

For a fixed cache size, if you make the block size bigger, you have fewer blocks, and thus cannot cover as wide a range of memory. Since one never has full spatial locality in a program, the smaller number of blocks may result in more misses. Here are data for a VAX, with a 1K cache size:

 block size miss rate 4 38% 16 22% 64 19% 256 28%

The author says that increasing the cache size increases access time. Why? The reason is that more gates means more gate delay.

p.723:

Recall that in C/C++, &x is the address of x. Say that address is 200. So, if we need to read x, the CPU will look for address 200 in the cache, right? Well, maybe so, maybe not. If we have a virtual memory system, the 200 is a virtual address. This presents a design decision. Should we design the cache to search based on the 200, or should we have the VM hardware convert the 200 to the physical address first, and then look in the cache for that address? Different CPUs do it different ways.

p.724:

The Pentium instruction set includes instructions which change the contents of the CR0 register, so as to set up various cache control patterns discussed here. Note, though, that only cache-level control is possible here, not block-level control. For the latter, see p.753.

The PowerPC chip was developed jointly by IBM and Motorola. It is used in Apple Macs and IBM RS/6000 workstations.

pp.725-726

Here is the intuitive interpretation for Table 17.7. Bit Bi tells us which group (within this set) has been (pseudo) less-recently used in the subtree this bit monitors. B2, for instance, monitors lines L4-L7. B2 = 0 means that the group L4-L5 is (pseudo) less-recently used than L6-L7, whereas B2 = 1 means that L6-L7 is (pseudo) less-recently used than L4-L5. As another example, B4 monitors L2-L3, with B4 = 0 meaning that L2 is (pseudo) less-recently used than L3.

So for instance the entry labeled L1 means, "If line L1 is accessed, update the PLRU bits as follows..." Let's consider each entry:

• The table says to change B0 to 1. The interpretation is, "The (pseudo) least-recently used line among L0-L7 is among L4-L7."
• The table says to change B1 to 1. The interpretation is, "Among lines L0-L3, the (pseudo) least-recently used line is among L2-L3."
• The table says to change B3 to 0. The interpretation is, "Among lines L0-L1, the (pseudo) less-recently used line is L0."
• The table says to not change B2. The interpretation is, "Among lines L4-L7, the (pseudo) least-recently used line and subtrees are still whatever they were before."
• Etc.

So for example suppose we have accesses to a certain set in the sequence L1, L0, L4, L0, L1, L5, L6. Then the Bi would change as follows:

 access B0 B1 B2 B3 B4 B5 B6 L1 1 1 - 0 - - - L0 1 1 - 1 - - - L4 0 1 1 1 - 1 - L0 1 1 1 1 - 1 - L1 1 1 1 0 - 1 - L5 0 1 1 0 - 0 - L6 0 1 0 0 - 0 1

For example, look at the entry in this table labeled L4. Since line L4 is among L4-L7, the lines L0-L3 now become the less-recently used group, so B0 becomes 0.

p.737:

The word virtual means "apparent." So, if x is an int variable in a C program, and if &x is 200, that address 200 will be fake. The actual location of x might be, say, 5000. We say that 200 is the virtual address of x, and 5000 is its physical address. A table is stored in memory which maps, i.e. translates, each virtual address to the corresponding physical address.

Suppose for the moment there is no cache. Then each time the CPU needs to access an item in memory, it checks the mapping table to find the physical address of the item, then accesses the item. Note that this means that what would have been just one memory reference now has become two.

What if the CPU does have a cache? Well, first of all, recall that the programmer can specify some memory items to be noncacheable, so accessing any memory item would follow the pattern described above. But for the cacheable items, the CPU will still look in the cache first. If the item is there, then fine, but if not, the CPU will need to go to memory in response to this cache miss, in which case again the mapping table will be involved.

You can now see what the author was referring to on p.723. When the CPU looks in the cache for our variable x, should it look under address 200 or 5000? You could design the CPU either way. The former way would be quicker, since it would eliminate a check to the mapping table. But on the other hand, it would mean that we couldn't have data from two different processes (which may both have items with virtual address 200) in the cache at the same time.

Another issue is that the item might not even be in memory at all. At any given times, some memory items will be stored on disk, since we don't have enough memory for everything. An item's entry in the mapping table will indicate whether the item is in memory, and if not, its location on disk.

Since each byte in memory has an address, the naive approach to mapping virtual-to-physical addresses would be to have a separate table entry for each byte, i.e. an entry for byte 0, an entry for byte 1, an entry for byte 2, an entry for byte 3, an entry for byte 4, an entry for byte 5, etc. But this would result in a truly gigantic table, in fact, one which would be larger than the address space itself.

Instead, we do the mapping on groups of bytes, called pages. On Pentium machines, for instance, the page size is 4096 bytes. So, we have a table entry for bytes 0-4095, then a table entry for bytes 4096-8191, etc. If we want to find the physical address for virtual addess i, we compute the virtual page number v by finding i/4096, then go to line v in the table. The entry there will tell us the physical page number p for the item we want.

p.739:

True LRU would be out of the question. If there are p pages, then we'd have to keep track of p! permutations.

The author has a good point about write-through being infeasible. Note that disk access time is on the order of milliseconds, whereas memory access time is on the order of nanoseconds. That's a huge gap, one to be avoided as much as possible.

p.740:

The term exception refers to an execution error, such as a divide-by-0 operation. The hardware, upon sensing an exception, will cause an internal interrupt, thus transferring execution to the OS.

p.741:

The term working set means the set of pages which have been recently accessed by the currently-running program.

The problem of having too large a page size is like the problem of having too large a block size in a cache. Larger page size means that fewer pages can be in memory at a given time, so if the program's memory acceses are dispersed in lots of widespread locations, we'll have a lot of page faults.

Here's a quick tutorial on disk structure. Each bit on a disk is stored as a magnetized spot on the disk. The spots are arranged in concentric rings called tracks. Within each track, the spots are grouped into sectors, say 512 bytes each. Each time we do a disk access (which we do a sector at a time, i.e. one access is one sector), there are three components of the access time: First, the disk's read/write head must be moved to the proper track, an action called the seek. Then we must wait for the rotational delay, i.e. for the desired sector to rotate around to the read/write head. Finally, there is the transfer time, i.e. the time while the sector is actually being read/written.

Note the phrase at the bottom of the page, "we can use the virtual page number as an index into the page table." This means that the virtual page number tells us where in the page table to find the entry of interest to us, so that we avoid having to search every line in the table. This is very common phrasing, especially in the remainder of this chapter, so keep it in mind.

p.743:

Here's what's going on with the Referenced Bit: Each time the CPU accesses a memory item (again, assume no cache for the time being, this bit will be set by the hardware in the corresponding page table entry. The OS could periodically, say every 20 page faults, clear all such bits. That way, at any given time, the pages for which the bit is set are pages which have been accessed during the time spanned by the last 20 page faults (definitely including the accesses which didn't cause a fault). So, the OS would choose a page to be evicted among those for which this bit was 0.

The second sentence of the first paragraph of Sec. 18.4 is rather misleading, and the third sentence is confused. The key point is that for all the benefits virtual memory gives us (programmer convenience, security, etc.), we do pay a price for it--it slows down the machine. This is because VM doubles the number of memory references we need to make, due to the extra memory reference needed for the page table. The obvious solution is to have a special cache dedicated to the page table. That cache is called a TLB.

pp.746-747:

With the "ordinary" type of page table studied so far (forward type), the virtual page number tells us exactly which entry of the page table to look at. Here, though, we use a hash function to guess where the desired entry is, and then search for it from there.

In the design shown in the picture, the VPN is input to a hash function, which yields an entry in the HAT, which in turn yields our first guess as to the location of the page table entry for our desired page. If the VPN field there matches the VPN of our desired page (and if the Pid matches, as explained below), then we have found the correct entry and proceed as usual. If it is not a match, then the pointer field in this entry shows us which entry to check next. Note that the VPN field in the entry is like a tag field in a cache.

Note that process ID is in the page table entry. With the ordinary type of hashing, we would need a different page table for each process. After all, many processes might have a virtual address 200, for instance, and so if they all shared the same page table, the page table entry for 200 would have to give a list of all physical addresses for virtual address 200, one for each process. That would mean page table entries would be of variable length, and thus we could not use the virtual page number as a direct index into the table. (Do you see why?) But with an inverted page table, we could have several different entries, one for each process which had a virtual address 200.

p.753:

Recall that virtual memory is a collaborative effort between hardware and software. When the CPU generates a virtual address for a memory item to be accessed, the CPU will do the lookup of the page table, and if the desired page is in memory, the CPU will then access the desired item. But if the page is not in memory, the CPU will execute an internal interrupt, which causes the OS to run. The OS decides which currently-resident page to evict, goes to disk to bring in the new page, and then updates the page table accordingly.

That means that the same program, running on the same hardware but under two different OSs, might have many fewer page faults under one OS than the other. Say for example we have a straightforward C program, which we will run on the same PC, once under Windows and then under Linux. Though the two compilers will produce slightly different code (even if we use Cygwin, a version of gcc for Windows, there still will be different system calls), the two versions of the code will be pretty similar. However, they may have very different numbers of page faults, since Windows and Linux will have different policies for page eviction.

The Intel engineers has such OS flexibility in mind when they designed the Pentium. For example, they left open the bit fields labeled "Avail" in the picture. The OS designer can use these to stored information which the OS will make use of regarding this page.

The software is also allowed to control what parts of memory are cacheable and which are not. (It need not be the OS doing this. An application program could do it too, as long as it was given supervisor privilege, or the OS accepted requests from the application program.) The software would either set or clear the PCD bit to do this. Note that this will affect cacheability in both L1 and L2.

p.754:

Let's see how the Pentium instruction TLB works. It is 4-way set associative, i.e. 4 page table entries per set. There are 32 entries stored in the whole TLB, so there are 32/4 = 8 sets. That means we need 3 bits to specify the set. Those 3 bits are the lower 3 bits of the virtual address. The remaining 17 bits form the tag; if the upper 17 bits of the requested VPN match the 17 bits in any of the 4 entries in the set, we have then found the proper entry. If not, we must get the proper entry from the page table, and bring it into the TLB, evicting a currently-resident entry.

pp.756-757:

Though not mentioned explicitly either in the text or the picture, a page table entry for a given VPN is guaranteed to be in one of the 16 slots in the picture. The CPU first looks at the upper group of eight, checking the VPN fields of all eight againt the given VPN. If no match there, then it checks the second group of eight. If no match there, then it is a page fault.

p.777:

The author mentions that without DMA, the CPU might spend a substantial chunk of its time performing I/O transfer. (And a slower CPU might not be able to keep up at all.) What is bad about that? Well, recall that on a general-purpose machine, there are typically many processes running concurrently. The more time the CPU spends on doing I/O, the less time the other processes get.

Note that the term slave refers to the fact that, even though the DMA controller (DMAC) acts like a CPU--it transfers data along the bus to and from memory--it won't start its work unless told to do so by the CPU, the master. Note that it is not really the CPU that does this, but the program running on the CPU, typically part of the OS. Note also that the DMAC has its own various ports for this purpose, e.g. a Command Register at port 0x10 in the original PC.

As explained by the author, the DMAC must wait for the CPU to relinquish the bus (i.e. float its connections to the address and data lines) before beginning the I/O transfer. The CPU will do this when it doesn't need the bus itself. Fortunately, this should be fairly often, if the cache is working well. Note also that overall usage of the bus will be reduced if we use DMA; going through the CPU for I/O would require two bus cycles for each byte, whereas using DMA would need only one (hence the "D" in "DMA").

p.778:

When the author says that the DMAC "requests the bus," this is a request to the CPU to relinquish the bus, and is done by the DMAC asserting the HOLD line in the bus, and the CPU signalling relinquishment by asserting HLDA. This is an example of handshaking, which refers to an exchange of signals between two entities sharing a common communications medium, in this case the bus.

p.779:

The DREQ/DACK exchange is another example of handshaking. This is overhead, which is why the block-transfer mode in DMA is better--having only one handshaking exchange for a whole block of bytes, rather than having the exchange repeatedly for each single byte. p.784:

The terms code and codeword appears frequently in any discussion on error detection/correction. The former means the coding scheme as a whole, i.e. the set of rules it uses. The latter refers to the original data combined with bits for error detection/correction. In the parity example here, a codeword consists of 7 bits of original data plus 1 bit for error detection, so that codewords are 8-bit quantities. If our original data was 0110101, then the parity bit will be 0, and the codeword will be 01101010. If an error occurs, say on the first bit, the receiver will see 11101010, and declare that to be an illegal codeword, thus detecting that an error occurred (though not knowing which bit was in error).

Note that the transmitter is guaranteed to send out only legal codewords, but the receiver could receive either legal or illegal ones. Note also that if the receiver receives an illegal codework, then the receiver definitely knows at least one error occurred, but not conversely; if the receiver receives a legal codeword, there still could be an error.

p.785:

In the sentence, "What this implies..." the phrase "every other codeword" means "every second codeword," not "all the other codewords." This is in the same sense that the numbers 1, 3, 5, 7 comprise every second number in the sequence 0, 1, 2, 3, 4, 5, 6, 7.

p.786:

Concerning the first paragraph: The author is saying, "Suppose all legal codewords in this code differ from each other by at least a distance 3 in the Hamming sense. Then if a single-bit error occurs, the resulting codeword will still be closer to the original, i.e. correct, codeword word sent by the transmitter than to any other legal codeword." That's because the single-bit error corresponds to a Hamming distance of 1, while the distance to any other legal codeword would have to be more. Thus if there really is just one bit in error, we can correctly determine what the original codeword was, and thus the original correct message.

Here is another way to understand it. Suppose in reading an e-mail message you see the word "taht." You immediately know that the writer meant "that." How come? Because no other English word comes close to this one. That is the same point that Dandamudi is making here. If the legal codewords in a code are rather far apart from each other (distance 3 in the example), then if a 1-bit error occurs, it's easy to see what the original codeword was, because the corrupt codeword will be closer to the original than to any other legal codeword.

The reason for concentrating on single-bit errors is that multi-bit errors should be extremely rare in the present context. See the next page.

When the author says that, e.g., P1, checks itself, that is kind of confusing. Instead, it would have been better to say, "Check bit P1 looks at bit positions 3, 5, 7, 9 and 11, and then is set accordingly," just like an ordinary parity bit is set according to what the other bits do.

p.787:

"Independent" here means "statistically independent." The probability calculation is incorrect. While the probability that two specified bit positions are in error would be 10-16, the overall probability of having two bits in error somewhere in the codeword would be considerably more. The correct probability would still be much smaller than 10-8, though.

p.788:

The basic procedure is as follows. D is our message to be sent, with a variable number of bits; G is our CRC divisor, a fixed bit string hardwired into the CRC chip. We "divide" D by G (using weird arithmetic), getting a remainder of R, and then append R to D. D+R is then the codeword we send. The receiver divides the codeword by G, and should get a 0 remainder; if not, the receiver knows there was an error.

The "weird arithmetic" is bitwise mod 2. Here 0 - 1 = 1, and due to being done bitwise, there are no carries or borrows. In doing the "long division," x will "go into" y if y has at least as many bits as x, not counting leading 0s.

The author calls the 16 or 32 bits in R a "small overhead." This is true because typically the message D is far longer than that.

Why use CRC? Because various theoreticians have proven that CRC has nice properties, including considerably protection against burst errors, the type the author brings up here. In fact, if the CRC divisor used ("G") has n+1 bits, then it can detect all burst errors up to length n bits. One can show that various other large classes of errors are detected too.

p.790:

The author's quotient in Fig. 19.8 looks wrong. I get 1011011. However, the author's remainder matches mine, and his receiver action in Fig. 19.9 does work.

p.790:

The 74F401 chip gives the user a choice of using three different G's. To use the polynomial x16+x15+x2+1, i.e. G = 1100000000000011, we set the Si to 000.

p.791:

Concerning Fig. 19.10, here is a general formulation of how to build a circuit for any n-bit CRC divisor G: Set up a shift register n-1 bits long, numbered s0, s1,..., sn-2. Have si feed into si+1, i = 0, 1, ..., n-3. For any i between 0 and n-2 inclusive, if the polynomial for G has a 1 coefficient for xi, then place an XOR gate at the input to si, and have the inputs to the XOR be si-1 and sn-2. For the case i = 0, the other input is from the data string, which consists of the original data with n-1 0s appended to the less-significant end. The most-significant bit is input first. After the least-significant bit is input, the remainder string is now contained in the shift register.

p.794:

One thing the author doesn't mention here is that the Start Bit also resets the receiver clock. The sender and receiver clocks are nominally the same speed, but there is some small variation in manufacturing, so the two clocks will gradually drift apart. The author notes that the receiver will "sample in the middle of a bit [period]," but actually that means it will "TRY to sample where it THINKS the middle of the bit period is." If the two clocks have developed a discrepancy of more than half a bit period, the receiver will look at the wrong bit!

By seeing a 1-0 transition the receiver knows that this is the start of a bit time, and thus can reset its clock to 0 accordingly, and the sender and receiver will now be back in sync, for a while.

pp.801ff:

About the term bandwidth: In the computer world, the meaning of this term is simply the number of bits per second that can be transmitted. Note that even though in the USB case we are just transmitting on a single wire, in contexts with multiple wires, e.g. a system bus, the bandwidth is the total number of bits transmitted per second collectively, among all the wires. (The term itself comes from the width of whatever band one is transmitting within a frequency spectrum. This of course is not part of our course.)

In the discussion here, you will see that the author talks about sharing of the bandwidth of the USB. This simply means that multiple I/O devices will use the USB at different times. However, to fully understand this, we need to understand how the hubs work. Look at Fig. 19.22.

The first thing you notice about the figure is that it doesn't look like one long single line--it looks like a tree, and that is what it is. The figure shows nine I/O devices in a tree structure, with the root of the tree connected to the "Host," meaning your PC. The OS software gives commands to the root hub--once again, this is done via outb instructions to the port number corresponding to the root hub--and the OS either reads data coming in from the tree or writes data to go out to some node in the tree.

In that light, what does it actually mean for the bandwidth to be shared? Look at the second and third I/O devices on the right side of the figure, and suppose they are both trying to transmit to the PC (i.e. the OS is reading from them). They are both connected to a hub, and the data they send will be read by the hub and sent toward the PC out the wire coming out of the top of the hub. But these two I/O devices can't physically send through the hub at the same time; they must take turns. So they are using that top wire at different times, thus are sharing the bandwidth of that wire.

And those two I/O devices are also sharing that top wire's bandwidth with the output out the top of the hub in the bottom right of the figure--which in turn is shared by the three I/O devices down there.

Each hub is very "intelligent," i.e. is a CPU in its own right. A given hub will poll the I/O devices connected to it, meaning that at regular intervals the hub does a check of each device to see if that device has any data to send. The hub can give some devices more turns than others, i.e. give priority. This is why the author talks about things like "allocating 90% of the bandwidth to a certain device," etc.

Even though each hub is really a CPU in its own right, remember that it is a slave device, in the same sense that a DMA controller is.

Data transmission along a serial path is always done in groups of bits. Fig. 19.16 shows this in the case of asynchronous transmission, with the group size being 10-11 bits. What about the case of USB?

Well, first of all, the I/O devices will form groups of transactions. It assembles several transactions into one packet, and sends the packet to the hub. The hub then takes transactions from several of the I/O devices connected to it, and combines them into a frame. The hub then sends out frames at a rate of 1 per ms.

Fig. 19.23 depicts this. Here two I/O devices feed into a hub. You can see how the hub is giving turns to the two devices. For example, the hub has built its first frame with the first transaction from one device and the first transaction from the other device.

The author uses another standard term here, saying that the hub multiplexes the input streams from the various devices connected to the hub. This simply means that it mixes the streams together, as you can see in the figure. Note that that also means that the root hub will demultiplexes the streams before presenting them to the PC.

p.803:

When the author says that signal transitions are used "to recover data," what he really means is recovering timing. Recall from my note on p.794 above that receivers use signal transitions to reset their clocks. That's what's going on here.

A note on bit stuffing: When the sender has six consecutive 1s to send in the data (not in the signal), it throws in a fake data bit which is a 0, for the purposes of getting a signal transition, as explained by the author. You can see, then, that the receiver has to be designed to match: When it receives six consecutive data 1s, it will expect a 0 next, and then simply discard that 0.