As you may know already, all information--program variables, disk files, etc.--is stored in binary form, i.e. as 0s and 1s. This document will give you a brief summary of this material.
A bit is a digit which is either 0 or 1. A byte is a string of 8 bits.
A more compact way for us humans to write down long bit strings is to use hex form (hex is just notation; the bit string still consists of 0s and 1s inside the machine). The bit string is partitioned into groups of 4 bits each. Each group is then treated as a 4-bit, base-2 number (even if the bit string itself is not representing a number). For example, the bit string 1101 would be treated as the number 13, since . Then each group is given a one-character name, one of the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, meaning 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, respectively.
For example, the bit string 1001110010101110 would be partitioned as 1001 1100 1010 1110, and thus written in compact form in the C language as 0x9cae (0x denotes hex numbers in C).
Note that one byte contains two hex digits.
You can view the contents of a file in hex form by using the -x option in the Unix od command. It will show the hex contents of each byte in the file.
For example, let us look at the first two lines of an executable file a.out:
heather% od -x a.out | head -2 0000000 8103 010b 0000 2000 0000 2000 0000 0000 0000020 0000 012c 0000 2020 0000 0000 0000 0000
Of course, all of this is not very meaningful unless we know something about the machine language of this machine; all we can say is that the first byte in a.out is 10000001, the second byte is 00000011, etc.
For variables of int type, the base-2 form of the number is stored. The number of bits used depends both on the machine and the compiler. For the Unix compiler on Sun SPARCstations, 32-bit strings are used; note that this is 8 hex digits.
For example, if a C program source file contains the lines
int G; G = 25;
and is compiled by the Unix compiler on a Sun SPARCstation, then the internal representation of G in memory will be
00000000000000000000000000011001
or in hex notation 0x00000019.
For negative numbers, the two's complement convention is used on most modern machines
On a 32-bit machine, for example, to get the bit representation for -1, we wind backwards once from
00000000000000000000000000000000
yielding
11111111111111111111111111111111
To represent -2 we wind backwards twice from 0, yielding
11111111111111111111111111111110
Similarly, the representation for -3, -4 and -5 would be
11111111111111111111111111111101
11111111111111111111111111111100
11111111111111111111111111111011
and so on.
On a 32-bit machine, there are (about 4 billion) possible bit patterns. In the manner shown above, the convention is to use half of these for positive numbers and half for negative numbers; in this way, the range of an int variable will be roughly from -2 billion to +2 billion. Note that all the positive numbers will have their leftmost bits being 0, while all the negative numbers will have a 1 in that leftmost position.
Variables of type char are handled in the following manner. Each possible character is given a number. The number 65 (in hex, 0x41), for example, means the character `A'. This is under the convention called ASCII, used by the vast majority of modern computers
Here is an illustration:
heather% cat gy abc heather% od -x gy 0000000 6162 630a
0x61 is the ASCII code for `a', 0x62 for `b', and 0x63 for `c'; 0x0a is the ASCII code for the end-of-line character.
Note that ASCII codes range from 0x00 to 0x7f (i.e. 0 to 127). Any byte in the range 0x80 to 0xff--i.e. any byte whose leading bit is a 1--may have unpredictable effects when used with type char, depending on the compiler. Some values may actually get changed.
Note, for example, the mail program may alter bytes in this range, thus transmitting incorrectly, necessitating the use of uuencode and uudecode. As an illustration of this, I compiled the simple program
int i; main() { i = 1; }
producing the file a.out. I then mailed the file to myself, saved the mailed copy as b.out, then removed the lines of mail header information (``From:'', ``Date:'', etc.), and then compared the first few bytes of each file as follows:
heather% od -x a.out | head -2 0000000 0162 0005 33a0 2b1c 2000 0000 0060 0000 0000020 0038 0007 010b 020a 1000 0000 1000 0000 heather% ^a^b od -x b.out | head -2 0000000 0162 2005 1c33 202b 3860 0b07 0a01 1002 0000020 2010 4001 1040 1010 7f7e 3307 0920 2e10
As you can see, the version I received in the mail was garbled. I got an execution error when I tried to run it.
In printf, scanf, etc., we can get hex format by using the %x format, just as we use %c for character format and %d for integer format.
Each memory location in the computer has an identification number, called an address. In most current machines, this extends down to the byte level; each byte has its own address. The first byte in memory is Byte 0, then Byte 1, then Byte 2, and so on.
When the compiler compiles your program, it will decide at what addresses to put each of your program's variables, and how many bytes each variable occupies. If you have to know this information (and in C applications we often do need to know), we can use the & operator to find the address of a variable, and the sizeof operator to find how many bytes the variable (and all variables of that same type) takes up.
Suppose, for example, that we have an int variable Y. The statement
printf("%x %d\n",&Y,sizeof(int));
will print out for us the address of Y, and the number of bytes taken up by any int variable (for this compiler-machine combination). Say the output is
40b0 4
Then we know that Y occupies Bytes 40b0, 40b1, 40b2 and 04b3.
Components of a complex data type are guaranteed in C to be stored contiguously. Thus for example, on a Sun, if we have the declaration
int Z[5];
and if Z starts at address 5000, then Z[0] will be at 5000, Z[1] at 5004, Z[2] at 5008, Z[3] at 500c, and Z[4] at 5010.
Similarly, elements within a struct are stored contiguously, and in the order declared.
ASCII Codes
Hexadecimal - Character | 00 NUL| 01 SOH| 02 STX| 03 ETX| 04 EOT| 05 ENQ| 06 ACK| 07 BEL| | 08 BS | 09 HT | 0A NL | 0B VT | 0C NP | 0D CR | 0E SO | 0F SI | | 10 DLE| 11 DC1| 12 DC2| 13 DC3| 14 DC4| 15 NAK| 16 SYN| 17 ETB| | 18 CAN| 19 EM | 1A SUB| 1B ESC| 1C FS | 1D GS | 1E RS | 1F US | | 20 SP | 21 ! | 22 " | 23 # | 24 $ | 25 % | 26 & | 27 ' | | 28 ( | 29 ) | 2A * | 2B + | 2C , | 2D - | 2E . | 2F / | | 30 0 | 31 1 | 32 2 | 33 3 | 34 4 | 35 5 | 36 6 | 37 7 | | 38 8 | 39 9 | 3A : | 3B ; | 3C < | 3D = | 3E > | 3F ? | | 40 @ | 41 A | 42 B | 43 C | 44 D | 45 E | 46 F | 47 G | | 48 H | 49 I | 4A J | 4B K | 4C L | 4D M | 4E N | 4F O | | 50 P | 51 Q | 52 R | 53 S | 54 T | 55 U | 56 V | 57 W | | 58 X | 59 Y | 5A Z | 5B [ | 5C \ | 5D ] | 5E ^ | 5F _ | | 60 ` | 61 a | 62 b | 63 c | 64 d | 65 e | 66 f | 67 g | | 68 h | 69 i | 6A j | 6B k | 6C l | 6D m | 6E n | 6F o | | 70 p | 71 q | 72 r | 73 s | 74 t | 75 u | 76 v | 77 w | | 78 x | 79 y | 7A z | 7B { | 7C | | 7D } | 7E ~ | 7F DEL|