Binary Representation and Storage of Information

As you may know already, all information--program variables, disk files, etc.--is stored in binary form, i.e. as 0s and 1s. This document will give you a brief summary of this material.

Bit-String Terminology

A bit is a digit which is either 0 or 1. A byte is a string of 8 bits.

A more compact way for us humans to write down long bit strings is to use hex form (hex is just notation; the bit string still consists of 0s and 1s inside the machine). The bit string is partitioned into groups of 4 bits each. Each group is then treated as a 4-bit, base-2 number (even if the bit string itself is not representing a number). For example, the bit string 1101 would be treated as the number 13, since . Then each group is given a one-character name, one of the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, meaning 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, respectively.

For example, the bit string 1001110010101110 would be partitioned as 1001 1100 1010 1110, and thus written in compact form in the C language as 0x9cae (0x denotes hex numbers in C).

Note that one byte contains two hex digits.

You can view the contents of a file in hex form by using the -x option in the Unix od command. It will show the hex contents of each byte in the file.

For example, let us look at the first two lines of an executable file a.out:

footnote: Notice the use of a Unix pipe here, as indicated by the | symbol. Ordinarily, the output of the od command would go to our screen, but instead we are piping it to the standard input of the head command, which in this case is a Unix ``filter'' which shows us the first 2 lines of the input.

heather% od -x a.out | head -2
0000000  8103 010b 0000 2000 0000 2000 0000 0000
0000020  0000 012c 0000 2020 0000 0000 0000 0000

Of course, all of this is not very meaningful unless we know something about the machine language of this machine; all we can say is that the first byte in a.out is 10000001, the second byte is 00000011, etc.

Storing Program Variables in Bit Form

Variables of Type int

For variables of int type, the base-2 form of the number is stored. The number of bits used depends both on the machine and the compiler. For the Unix compiler on Sun SPARCstations, 32-bit strings are used; note that this is 8 hex digits.

For example, if a C program source file contains the lines

int G;

G = 25;

and is compiled by the Unix compiler on a Sun SPARCstation, then the internal representation of G in memory will be

00000000000000000000000000011001

or in hex notation 0x00000019.

For negative numbers, the two's complement convention is used on most modern machines

footnote: Again, IBM mainframes are the main exception.

. Under this system, the bit representation for the negative number -n is obtained by ``winding backwards'' n times from the bit representation for 0.

On a 32-bit machine, for example, to get the bit representation for -1, we wind backwards once from

00000000000000000000000000000000

yielding

11111111111111111111111111111111

To represent -2 we wind backwards twice from 0, yielding

11111111111111111111111111111110

Similarly, the representation for -3, -4 and -5 would be

11111111111111111111111111111101

11111111111111111111111111111100

11111111111111111111111111111011

and so on.

On a 32-bit machine, there are (about 4 billion) possible bit patterns. In the manner shown above, the convention is to use half of these for positive numbers and half for negative numbers; in this way, the range of an int variable will be roughly from -2 billion to +2 billion. Note that all the positive numbers will have their leftmost bits being 0, while all the negative numbers will have a 1 in that leftmost position.

Variables of Type char

Variables of type char are handled in the following manner. Each possible character is given a number. The number 65 (in hex, 0x41), for example, means the character `A'. This is under the convention called ASCII, used by the vast majority of modern computers

footnote: Some hardware (keyboards) and software (compilers, etc.) made for use on old machines such as IBM mainframes use another convention, called EBCDIC.

. You can get a table of all ASCII codes, in hex form, by typing man ascii; a summary of it is given below.

Here is an illustration:

heather% cat gy
abc
heather% od -x gy
0000000  6162 630a

0x61 is the ASCII code for `a', 0x62 for `b', and 0x63 for `c'; 0x0a is the ASCII code for the end-of-line character.

Note that ASCII codes range from 0x00 to 0x7f (i.e. 0 to 127). Any byte in the range 0x80 to 0xff--i.e. any byte whose leading bit is a 1--may have unpredictable effects when used with type char, depending on the compiler. Some values may actually get changed.

Note, for example, the mail program may alter bytes in this range, thus transmitting incorrectly, necessitating the use of uuencode and uudecode. As an illustration of this, I compiled the simple program

 
int i;

main()

{ i = 1; }

producing the file a.out. I then mailed the file to myself, saved the mailed copy as b.out, then removed the lines of mail header information (``From:'', ``Date:'', etc.), and then compared the first few bytes of each file as follows:

heather% od -x a.out | head -2
0000000  0162 0005 33a0 2b1c 2000 0000 0060 0000
0000020  0038 0007 010b 020a 1000 0000 1000 0000
heather% ^a^b
od -x b.out | head -2
0000000  0162 2005 1c33 202b 3860 0b07 0a01 1002
0000020  2010 4001 1040 1010 7f7e 3307 0920 2e10

As you can see, the version I received in the mail was garbled. I got an execution error when I tried to run it.

In printf, scanf, etc., we can get hex format by using the %x format, just as we use %c for character format and %d for integer format.

Memory Locations

Each memory location in the computer has an identification number, called an address. In most current machines, this extends down to the byte level; each byte has its own address. The first byte in memory is Byte 0, then Byte 1, then Byte 2, and so on.

When the compiler compiles your program, it will decide at what addresses to put each of your program's variables, and how many bytes each variable occupies. If you have to know this information (and in C applications we often do need to know), we can use the & operator to find the address of a variable, and the sizeof operator to find how many bytes the variable (and all variables of that same type) takes up.

Suppose, for example, that we have an int variable Y. The statement

printf("%x %d\n",&Y,sizeof(int));

will print out for us the address of Y, and the number of bytes taken up by any int variable (for this compiler-machine combination). Say the output is

40b0 4

Then we know that Y occupies Bytes 40b0, 40b1, 40b2 and 04b3.

Components of a complex data type are guaranteed in C to be stored contiguously. Thus for example, on a Sun, if we have the declaration

int Z[5];

and if Z starts at address 5000, then Z[0] will be at 5000, Z[1] at 5004, Z[2] at 5008, Z[3] at 500c, and Z[4] at 5010.

Similarly, elements within a struct are stored contiguously, and in the order declared.

ASCII Codes

     Hexadecimal - Character

     | 00 NUL| 01 SOH| 02 STX| 03 ETX| 04 EOT| 05 ENQ| 06 ACK| 07 BEL|
     | 08 BS | 09 HT | 0A NL | 0B VT | 0C NP | 0D CR | 0E SO | 0F SI |
     | 10 DLE| 11 DC1| 12 DC2| 13 DC3| 14 DC4| 15 NAK| 16 SYN| 17 ETB|
     | 18 CAN| 19 EM | 1A SUB| 1B ESC| 1C FS | 1D GS | 1E RS | 1F US |
     | 20 SP | 21  ! | 22  " | 23  # | 24  $ | 25  % | 26  & | 27  ' |
     | 28  ( | 29  ) | 2A  * | 2B  + | 2C  , | 2D  - | 2E  . | 2F  / |
     | 30  0 | 31  1 | 32  2 | 33  3 | 34  4 | 35  5 | 36  6 | 37  7 |
     | 38  8 | 39  9 | 3A  : | 3B  ; | 3C  < | 3D  = | 3E  > | 3F  ? |
     | 40  @ | 41  A | 42  B | 43  C | 44  D | 45  E | 46  F | 47  G |
     | 48  H | 49  I | 4A  J | 4B  K | 4C  L | 4D  M | 4E  N | 4F  O |
     | 50  P | 51  Q | 52  R | 53  S | 54  T | 55  U | 56  V | 57  W |
     | 58  X | 59  Y | 5A  Z | 5B  [ | 5C  \ | 5D  ] | 5E  ^ | 5F  _ |
     | 60  ` | 61  a | 62  b | 63  c | 64  d | 65  e | 66  f | 67  g |
     | 68  h | 69  i | 6A  j | 6B  k | 6C  l | 6D  m | 6E  n | 6F  o |
     | 70  p | 71  q | 72  r | 73  s | 74  t | 75  u | 76  v | 77  w |
     | 78  x | 79  y | 7A  z | 7B  { | 7C  | | 7D  } | 7E  ~ | 7F DEL|

Norm Matloff
Wed Nov 8 17:29:18 PST 1995