Norman Matloff
Department of Computer Science
University of California at Davis
April 12, 1998
All programs described here are in the public domain, copyable free of charge. The Web site has links to the sites at which you can obtain the
Unix has become the standard operating system for academia and in many parts of business, government and industry. Its public-domain version for PCs, Linux, has become quite popular in that realm. At the same time, the number of Chinese-speaking users of Unix has also grown tremendously since the introduction of Unix in the 1970s.
In addition to Unix's elegant functionality, one of its nicest features is that there is so much public-domain software available free of charge for Unix platforms. Here we will introduce some of the Chinese-language software which is available in public-domain form, both for native speakers and also for learners of Chinese.
Computer-phobes need not worry; Chinese computing is easy, and this document has been written so as to facilitate the learning process. It does take a little patience, though.
We assume here only a very rudimentary familiarity of Unix (e.g. the ls command to list file names in a directory), and familiarity with the vi text editor. For those who are unfamiliar with Unix (or who want to learn more), a later section in this document will point the reader to online tutorial material.
It is also assumed that the reader has been, or will be, using Unix in an X11-windows environment.
Though most computers today use the same encoding system for English characters (not only letters but also punctuation and numerals), the ASCII system, there are at least three major encoding systems for Chinese: Big5, Guobiao and Hz.
In doing any work in Chinese on a computer, the user must first know what encoding system is used for files he/she is reading, or if he/she is creating a file, which system to use. Usually the user will choose one particular system, and then stick with it. If he/she is sent a file in another encoding, then he/she will simply use a conversion program (described later) to convert the file into the desired encoding.
A good rule of thumb for people who are new to Chinese computing would be to use Big5 if they prefer traditional Chinese characters or Guobiao if they use the simplified characters. Note, though, that files in Big5 and Guobiao cannot be sent through e-mail; for that purpose they should first be converted to Hz or run through uuencode, as described in a later section.
The remainder of this section will be devoted to some technical detail on encodings. Though the material is understandable by those without extensive computing backgrounds, such readers may safely skip to the next section.
Inside a computer, a character (of any language) is represented as a byte, i.e. a string of eight 1s and 0s. For example, under the ASCII system which is standard for most computers today, the letter `A' is coded as 01100001.
Chinese characters are of course much more numerous than English
ones, so they need longer encodings, two bytes each. For instance,
under the Big5 encoding system popular in Hong Kong and Taiwan, the
coding for
(tian1, ``sky'') is 1101000110100100.
The Guobiao encoding for tian1 is 1110110011001100.
Note that the Big5 code for tian1 above begins with a 1, while the ASCII code for `A' began with a 0. This pattern holds generally as well: All Big5 codes begin with 1, and all ASCII codes begin with 0. In this manner, in a file which contains both English and Chinese text, the software can detect whether a given byte is intended as English or Chinese. All the Guobiao encodings begin with 1 too.
On the other hand, a problem arises in that many e-mail systems cannot properly transmit any byte which begins with a 1. Thus messages sent in Big5 or Guobiao will be received as garbled. It is for this reason that Fung-Fung Lee, a student at Stanford University, developed the Hz encoding system, which works as follows: Suppose, for example, we have a mixed Chinese/English sentence, such as
The Chinese pictograph for the word woman,, was recently criticized as having sexist connotations, as it is derived from a picture of a person kneeling, allegedly in subservience.
Under Hz encoding, for each Chinese character in the file, the leading
1 in the character's Guobiao code is replaced by a 0. In other
words, we now have ASCII codings! But in order to distinguish
between the Chinese characters and the English characters, Hz
encoding inserts ~{
at the start of Chinese text and
~}
at the end.
In the example above, Hz will insert
~{
just before
and
~}
just after it. If there had been several consecutive Chinese
characters, there would only be one instance each of
~{ and ~}, before the first Chinese character and
after the last one, respectively. In other words,
~{ and ~}mean ``start Chinese text'' and ``end Chinese
text.''
The cxterm program is a modified version of the xterm program, the shell window through which most X11 users issue Unix commands. The modifications enable cxterm to display Chinese characters coded in Big5 or Guobiao.
When xterm sees a byte to be displayed on the screen, it will assume the byte is ASCII (and if not, the byte won't be displayed), but cxterm will display either Chinese or ASCII, depending on whether the byte (more precisely, pair of bytes) begins with a 1. In this way cxterm will properly display text which is English, mixed English-Chinese or completely Chinese.
If you have sufficient disk space, the simplest way to install cxterm is to get the whole package cxterm-5.0.tar.gz (current version is 5.0), and take advantage of the automated installation which comes with the package. (If you do have disk-space problems, check the UC Davis ChineseWare World Wide Web site for an already-assembled executable version of cxterm for your machine.)
First, acquire the package cxterm-5.0.tar.gz from an Internet site (described in a later section), put it in your home directory. (It will be assumed throughout this document that you did all this from your home directory, and that you are using the C shell):
Then type the following:
gunzip cxterm-5.0.tar.gz tar xf cxterm-5.0.tar rm cxterm-5.0.tar cd cxterm-5.0 ./config.sh
The last step will take about 20 minutes. It is completely automated,
except that it will ask you the specify directories in which you want
certain files placed, such as the font files. It does not matter which
directories you specify; it is only a matter of your personal preference.
However, to avoid complications, make sure that these directories
are outside of the directory tree whose root is ~/cxterm-5.0.
After you have installed cxterm and have used it for a while, trying its various features, you will probably want to remove the original files, to save disk space:
rm -r ~/cxterm-5.0
I have Sun, HP, SGI, DEC and Linux binaries for an extension of cxterm, which I call kxterm, on my KX Web page.
Here is a sample method for running cxterm, used by the author:
cxterm -fh hku16et -hz BIG5 -hid ~/ChineseLanguage/CXTerm5.0/Dict/big5 -name cxtermb5 &
Here we are asking for the hku16et traditional-character font set,
stating that we will be viewing Chinese text which uses Big5 encoding,
and that the directory where we have stored the cxterm Chinese
input dictionaries is ~/ChineseLanguage/CXTerm5.0/Dict/big5.
If the author anticipated viewing Guobiao-encoded files and wished to use gb16st simplified-character fonts, he would type
cxterm -fh hanzigb16st -hz GB -hid ~/ChineseLanguage/CXTerm5.0/Dict/gb -name cxterm &
The cxterm package includes a number of methods you can use to
input Chinese characters. Suppose, for instance, that you are editing
a Chinese-language file and you wish to type the Chinese character
. One of the ways you can do so is to first get into
Chinese-input mode (by typing a special key, described below), and
then type
tian1
The bottom of the cxterm screen will display a set of choices
of characters having the pronunciation tian1, and then you can
select
by clicking the mouse on it. The character
will then appear on the screen!
The files for the various input methods are in the subdirectories of
the directory ~/cxterm-5.0/dict. General documentation is in the
file ~/cxterm-5.0/Doc/input.doc. (That file uses Guobiao encoding.
If you use Big5, use a conversion program first, as described below.)
The examples below assume BIG5 encoding, using the command-line structure
as given in the example above. The function key mappings set up for GB
are different. Check the mappings in the file
~/cxterm-5.0/cxterm/CXterm.ad.
Probably the most commonly used input method is PY, which uses Mandarin
pinyin, as in the example with
above. Let us look at
that example in more detail.
The default mode is ASCII. To switch to PY mode, we hit the F4 key.
(There is a list of these keys in the file
~/cxterm-5.0/cxterm/CXterm.ad.)
We then type the pinyin representation of the sound of the character
(but not the tone), after which it cxterm will display a
number of characters having that sound, of various tones. If the
character is among those displayed, we can select it by clicking on
it with the mouse.
(We can also simply type the number which appears next to the character we want, providing that number is greater than 5; otherwise the number would be interpreted as a tone.)
If the character we want is not among those displayed, we can scroll to the right by hitting the `.' key to see more characters; the `>' mark at the right end of the line indicates that there are more characters to be displayed if we do scroll. Or instead of scrolling we can then type the tone number to narrow down the list. (Even with the narrowed-down list, though, we may have to scroll.)
Another input method is English! To get the two-character compound
for instance, we could hit the F8 key to get into the
``English'' method of Chinese input (sounds like a contradiction,
doesn't it?), and then type ``car''. Then cxterm would give us a
number of choices:
Some of the choices look odd, at first, but actually make sense when one takes into account the fact that cxterm does not know whether you have finished typing yet. For instance, Choice 3 is there because it means ``card''; so far we have only typed ``car'', but cxterm must allow for the possibility that we will next type a `d', yielding `card', so this choice is also included. We would hit the 2 key, though, since that is the choice we want.
Another feature common to many cxterm input methods is
, which can save you typing. If for example you
type
in PY mode, cxterm will then automatically
display a menu of followup characters:
since they are commonly-occurring compounds. If the followup character you want is displayed, just type its number; otherwise, use the backspace key to wipe clear the display.
A number of other cxterm features can save you keystrokes. Please check the file ~/cxterm-5.0/Doc/input.doc for details.
Suppose you are a Big5 user but wish to use an input method found in the subdirectory ~/cxterm-5.0/dict/gb, designed for Guobiao.
For example, consider the file in ~/cxterm-5.0/dict/gb/CTLau.cit, for romanized-Cantonese input. You would like to have a corresponding file in your Big5 dictionary directory. To achieve this goal, go to the latter directory, and type
cp ~/cxterm-5.0/dict/gb/CTLau.tit zzzz vi zzzz ~/cxterm-5.0/utils/tit2cit < zzzz > CTLau.cit
The vi editing step in this sequence consists only of changing the sixth line of zzzz to read
ENCODE: BIG5
Note that in such a case you will probably have to change an entry in the CXterm.ad file, so that a function key, say F6, will point to the file for this new input method.
Theoretically you can use vi or any other text editor on files which include Chinese characters. However, this may cause problems, since each Chinese character consists of two bytes, compared to one byte for English characters. The vi command x, for instance, is designed to delete one English character, so if you use it you would end up deleting only ``one-half'' Chinese character.
Thus, Chinese versions of popular Unix editors have been developed, such as celvis, a Chinese version of vi; cemacs, a Chinese version of emacs (actually just a simple emacs function which is loaded into regular emacs); a Chinese version of the joe editor (complete with online help in Chinese!); and so on. If you type the x command in celvis, for instance, celvis will first sense whether the current cursor position on the screen is at an English or Chinese character; it will then delete one byte in the former case but two bytes in the latter case.
You may obtain the source code (and for Linux, the binary executables) for these editors from the IFCSS files.
Note, though, that if you like vi, the celvis source code is now rather old, and I have found that it does not compile on many systems. (Your greatest chance of success, I find, is to use the makefile.pos makefile, and comment-out the line in unix.c dealing with ospeed if you get an error message. No guarantees!)
An alternative, simple solution is as follows: First, find an "8-bit clean" version of vi, such as vim, elvis (with the -g termcap option) or xvi (with mchars set); this is necessary in order to make sure the Chinese characters appear correctly on the screen. (If they are not available on your system, you can get them from the vi home page.) Then, make use of the vi "map" feature, typing the following when you first start up vi:
:map X 2x :map L 2l :map H 2hThen, whenever you wish to delete a Chinese character at the cursor, type X instead of x; to move left one Chinese character, type H instead of h; and to move right one Chinese character, type L instead of l.
On the other hand, if your document is solely or mostly Chiense isntead of English, you probably don't need a powerful editor like vi anyway, in which case I would suggest the Chinese version of the joe editor mentioned above.
Suppose you are viewing a Chinese file using a Chinese editor,
and wish to search for a Chinese phrase, for example
. In celvis, we would use the usual
vi search command, hitting the / key, and then enter
a Chinese mode such as PY and input
. We
would then hit the return key as usual, and celvis
would perform the search for us.
A newer Chinese (and Unicode) editor is mined.
Many programs to convert from one encoding to another are available. If for example you have a file coded in Big5 but wish to convert it to Hz (say, to send to someone via e-mail), programs exist to do this.
Another important conversion program is hztty, which gives you more flexibility and convenience when viewing Chinese text under cxterm. Suppose, for example, you have are running cxterm under Guobiao encoding and wish to view (and even write response messages to) the Chinese-language Usenet newsgroup alt.chinese.text, which uses Hz coding. You would then type
hztty -I gb2hz -O hz2gb
The easiest way to print hard copies of Chinese files from Unix is to use the hz2ps program. If you are using Big 5 encoding, I suggest that you run hz2ps with the command-line options
-big -hf kck24.hbf 12 1So, for instance, if you have the Big 5 file x.b5, you can type
hz2ps -big -hf kck24.hbf x.b5 > x.psThe file x.ps is a Postscript file which you can then print on any Postscript printer using the lpr command.