\documentclass[11pt]{article} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \setlength{\topmargin}{0.0in} \setlength{\headheight}{0in} \setlength{\headsep}{0in} \setlength{\textwidth}{6.5in} \setlength{\textheight}{9.0in} \setlength{\parindent}{0in} \setlength{\parskip}{0.1in} \usepackage{times} \usepackage{hyperref} \usepackage{fancyvrb} \usepackage{relsize} \begin{document} \title{A Quick, Painless Introduction to the Perl Scripting Language} \author{Norman Matloff\\ University of California, Davis\\ \copyright{}2002-2005, N. Matloff} \date{May 23, 2004} \maketitle \tableofcontents{} \newpage \section{What Are Scripting Languages?} Languages like C and C++ allow a programmer to write code at a very detailed level which has good execution speed. But in many applications one would prefer to write at a higher level. For example, for text-manipulation applications, the basic unit in C/C++ is a character, while for languages like Perl and Python the basic units are lines of text and words within lines. One can work with lines and words in C/C++, but one must go to greater effort to accomplish the same thing. C/C++ might give better speed, but if speed is not an issue, the convenience of a scripting language is very attractive. The term {\it scripting language} has never been formally defined, but here are the typical characteristics: \begin{itemize} \item Used often for system administration and ``rapid prototyping.'' \item Very casual with regard to typing of variables, e.g. no distinction between integer, floating-point or string variables. Functions can return nonscalars, e.g. arrays, nonscalars can be used as loop indexes, etc. \item Lots of high-level operations intrinsic to the language, e.g. stack push/pop. \item Interpreted, rather than being compiled to the instruction set of the host machine. \end{itemize} Today the most popular scripting language is probably Perl. However, many people, including me, strongly prefer Python, as it is much cleaner and more elegant. Our introduction here assumes knowledge of C/C++ programming. There will be a couple of places in which we describe things briefly in a UNIX context, so some UNIX knowledge would be helpful.\footnote{But certainly not required. Again, Perl is used on Windows and Macintosh platforms too, not just UNIX.} \section{Goals of This Tutorial} Perl is a very feature-rich language, which clearly cannot be discussed in full detail here. Instead, our goals here are to (a) enable the reader to quickly become proficient at writing simple Perl programs and (b) prepare the reader to consult full Perl books (or Perl tutorials on the Web) for further details of whatever Perl constructs he/she needs for a particular application. Our approach here is different from that of most Perl books, or even most Perl Web tutorials. The usual approach is to painfully go over all details from the beginning. For example, the usual approach would be to state all possible forms that a Perl literal can take on. We avoid this here. Again, the aim is to enable the reader to quickly acquire a Perl foundation. He/she should then be able to delve directly into some special topic, with little or not further learning of foundations. \section{A 1-Minute Introductory Example} This program reads a text file and prints out the number of lines and words in the file: \begin{Verbatim}[fontsize=\relsize{-2},numbers=left] # comments begin with the sharp sign # open the file whose name is given in the first argument on the command # line, assigning to a file handle INFILE (it is customary to choose # all-caps names for file handles in Perl) open(INFILE,@ARGV[0]); # names of scalar variables must begin with $ $line_count = 0; $word_count = 0; # <> construct means read one line; undefined response signals EOF while ($line = ) { $line_count++; # break $line into an array of tokens separated by " ", using split() # (array names must begin with @) @words_on_this_line = split(" ",$line); # scalar() gives the length of any array $word_count += scalar(@words_on_this_line); } print "the file contains ",$line_count," lines and ", $word_count, " words\n"; \end{Verbatim} Note that as in C, statements in Perl end in semicolons, and blocks are defined via braces. \section{Variables} \subsection{Types} Type is not declared in Perl, but rather is inferred from a variable's name (see below), and is only loosely adhered to. Note that a possible value of a variable is {\bf undef} (i.e. undefined), which may be tested for, using a call to {\bf defined()}. Here are the main types: \subsubsection{Scalars} Names of {\bf scalar} variables begin with \$. Scalars are integers, floating-point numbers and strings. For the most part, no distinction is made between these. There are various exceptions, though. One class of exceptions involves tests of equality or inequality. For example, use {\bf eq} to test equality of strings but use {\bf ==} for numbers. \subsubsection{Arrays} Array names begin with @. Indices are integers beginning at 0, and the array elements are scalars (and thus array elements begin with \$, not @). Arrays are referenced for the most part as in C, but in a more flexible manner. Their lengths are not declared, and they grow or shrink dynamically, without ``warning,'' i.e. the programmer does not ``ask for permission'' in growing an array. For example, if the array {\bf \@x} currently has 7 elements, i.e. ends at {\bf \$x[6]}, then the statement \begin{Verbatim}[fontsize=\relsize{-2}] $x[7] = 12; \end{Verbatim} changes the array length to 8. For that matter, we could have assigned to element 99 instead of to element 7, resulting in an array length of 100. The programmer can treat an array as a queue data structure, using the Perl operations {\bf push} and {\bf shift} (usage of the latter is especially common in the Perl idiom), or treat it as a stack by using {\bf push} and {\bf pop}. An array without a name is called a {\bf list}. For example, in \begin{Verbatim}[fontsize=\relsize{-2}] @x = (88,12,"abc"); \end{Verbatim} we assign the array name {\bf @x} to the list (88,12,"abc"). We will then have {\bf \$x[0]} = 88, etc. One of the big uses of lists and arrays is in loops, e.g.:\footnote{C-style {\bf for} loops can be done too.} \begin{Verbatim}[fontsize=\relsize{-2}] # prints out 1, 2 and 4 for $i ((1,2,4)) { print $i, "\n"; } \end{Verbatim} The length of an array or list is obtained calling {\bf scalar()}, or by simply using the array name in a scalar context. \begin{Verbatim}[fontsize=\relsize{-2}] $x[0] = 15; $x[1] = 16; $y = shift @x; # "output" of shift is the element shifted out print $y, "\n"; # prints 15 print $x[0], "\n"; # prints 16 push(@x,9); # sets $x[1] to 9 print scalar(@x), "\n"; # prints 2 print @x, "\n"; # prints 169 (16 and 9 with no space) $k = @x; print $k, "\n"; # prints 2 @x = (); # @x will now be empty print scalar(@x), "\n"; # prints 0 \end{Verbatim} \subsubsection{Hashes} As a first look, you can think of {\bf hashes} or {\bf associative arrays} as arrays indexed by strings instead of by integers. Their names begin with \%, and their elements are indexed using braces, as in \begin{Verbatim}[fontsize=\relsize{-2}] $h{"abc"} = 12; $h{"defg"} = "San Francisco"; print $h{"abc"}, "\n"; # prints 12 print $h{"defg"}, "\n"; # prints "San Francisco" \end{Verbatim} However, a closer look at hashes reveals them to essentially be like C {\bf struct}s. In the above example, for instance, we have set up a hash named \%h which is analogous to a C {\bf struct} with {\bf int} and {\bf char []} fields, whose values here are 12 and ``San Francisco'', respectively. This correspondence is more clear in the equivalent (and more commonly used) alternative code \begin{Verbatim}[fontsize=\relsize{-2}] %h = (abc => 12, defg => "San Francisco"); print $h{"abc"}, "\n"; # prints 12 print $h{"defg"}, "\n"; # prints "San Francisco" \end{Verbatim} Here the first two lines look rather the declaration of a C {\bf struct}, as in \begin{Verbatim}[fontsize=\relsize{-2}] struct ht { int abc; char defg[20]; }; struct ht h; h.abc = 12; strcpy(h.defg,"San Francisco"); \end{Verbatim} Note, however, that there is no analog of {\bf ht} in our Perl example above. In fact, there are lots of other differences. For example, unlike C {\bf struct}s, hashes actually store their field names. In the example above, the number 12 and the string ``San Francisco'' are stored, but not the field names abc and defg. By contrast, Perl stores both! In the code above, if we add the line \begin{Verbatim}[fontsize=\relsize{-2}] print %h, "\n"; \end{Verbatim} the output of that statement will be \begin{Verbatim}[fontsize=\relsize{-2}] abc12defgSan Francisco \end{Verbatim} \subsubsection{References} {\bf References} are like C pointers. They are considered scalar variables, and thus have names beginning with \$. They are dereferenced by prepending the symbol for the variable type, e.g. prepending a \$ for a scalar, a @ for an array, etc.: \begin{Verbatim}[fontsize=\relsize{-2},numbers=left] # set up a reference to a scalar $r = \3; # \ means "reference to," like & means "pointer to" in C # now print it; $r is a reference to a scalar, so $$r denotes that scalar print $$r, "\n"; # prints 3 @x = (1,2,4,8,16); $s = \@x; # an array element is a scalar, so prepend a $ print $$s[3], "\n"; # prints 8 # for the whole array, prepend a @ print scalar(@$s), "\n"; # prints 5 \end{Verbatim} In Line 4, for example, you should view {\bf \$\$r} as {\bf \$(\$r)}, meaning take the reference {\bf \$r} and dereference it. Since the result of dereferencing is a scalar, we get another dollar sign on the left. \subsection{Anonymous Data} {\bf Anonymous} data is somewhat analogous to data set up using {\bf malloc()} in C. One sets up a data structure without a name, and then points a reference variable to it. A major use of anonymous data is to set up object-oriented programming, if you wish to use OOP. (Covered in Section \ref{oop}.) Anonymous arrays use brackets and braces instead of parentheses. The $->$ operator is used for dereferencing. Example: \begin{Verbatim}[fontsize=\relsize{-2}] # $x will be a reference to an anonymous array $x = [5, 12, 13]; print $x->[1], "\n"; # prints 12 # $y will be a reference to an anonymous hash (due to braces) $y = {name => "penelope", age=>105}; print $y->{age}, "\n"; # prints 105 \end{Verbatim} Note the difference between \begin{Verbatim}[fontsize=\relsize{-2}] $x = [5, 12, 13]; \end{Verbatim} and \begin{Verbatim}[fontsize=\relsize{-2}] $x = (5, 12, 13); \end{Verbatim} The former sets {\bf \$x} as a reference to the anonymous list [5,12,13], while the latter sets {\bf \$x} to the length of the anonymous list (5,12,13). So the brackets or parentheses, as the case may be, tell the Perl interpreter what we want. \subsection{Declaration of Variables} A variable need not be explicitly declared; its ``declaration'' consists of its first usage. For example, if the statement \begin{Verbatim}[fontsize=\relsize{-2}] $x = 5; \end{Verbatim} were the first reference to {\bf \$x}, then this would both declare {\bf \$x} and assign 5 to it. If you wish to make a separate declaration, you can do so, e.g. \begin{Verbatim}[fontsize=\relsize{-2}] $x; ... $x = 5; \end{Verbatim} If you wish to have protection against accidentally using a variable which has not been defined, say due to a misspelling, include a line \begin{Verbatim}[fontsize=\relsize{-2}] use strict; \end{Verbatim} at the top of your source code. \subsection{Scope of Variables} Variables in Perl are global by default. To make a variable local to subroutine or block,\footnote{This includes, for example, a block within an {\bf if} statement.} the {\bf my} construct is used.\footnote{There are many other scope possibilities, e.g. namespaces of packages.} \section{Subroutines} \subsection{Arguments, Return Values} Arguments for a subroutine are passed via an array {\bf @\_}. Note once again that the @ sign tells us this is an array; we can think of the array name as being {\bf \_}, with the @ sign then telling us it is an array. Here are some examples: \begin{Verbatim}[fontsize=\relsize{-2},numbers=left] # read in two numbers from the command line (note: the duality of # numbers and strings in Perl means no need for atoi()!) $x = @ARGV[0]; $y = @ARGV[1]; # call subroutine which finds the minimum and print the latter $z = min($x,$y); print $z, "\n"; sub min { if ($_[0] < $_[1]) {return $_[0];} else {return $_[1];} } \end{Verbatim} A common Perl idiom is to have a subroutine use {\bf shift} on @\_ to get the arguments and assign them to local variables. Arguments must be pass-by-value, but this small restriction is more than compensated by the facts that (a) arguments can be references, and (b) the return value can also be a list. Here is an example illustrating all this: \begin{Verbatim}[fontsize=\relsize{-2}] $x = @ARGV[0]; $y = @ARGV[1]; ($mn,$mx) = minmax($x,$y); print $mn, " ", $mx, "\n"; sub minmax { $s = shift @_; # get first argument $t = shift @_; # get second argument if ($s < $t) {return ($s,$t);} # return a list else {return ($t,$s);} } \end{Verbatim} \subsection{Alternative Notation} \label{altnot} Instead of enclosing arguments within parentheses, as in C, one can simply write them in ``command-line arguments'' fashion. For example, the call \begin{Verbatim}[fontsize=\relsize{-2}] ($mn,$mx) = minmax($x,$y); \end{Verbatim} can be written as \begin{Verbatim}[fontsize=\relsize{-2}] ($mn,$mx) = minmax $x,$y; \end{Verbatim} In fact, we've been doing this in all our previous examples, in our calls to {\bf print()}. This style is often clearer. On the other hand, if the subroutine, say {\bf x()}, has no arguments make sure to use the parentheses in your call: \begin{Verbatim}[fontsize=\relsize{-2}] x(); \end{Verbatim} rather than \begin{Verbatim}[fontsize=\relsize{-2}] x; \end{Verbatim} In the latter case, the Perl interpreter will treat this as the ``declaration'' of a variable {\bf x}, not a call to {\bf x()}. \subsection{Passing Subroutines As Arguments} Older versions of Perl required that subroutines be referenced through an ampersand preceding the name, e.g. \begin{Verbatim}[fontsize=\relsize{-2}] ($mn,$mx) = &minmax $x,$y; \end{Verbatim} In some cases we must still do so, such as when we need to pass a subroutine name to a subroutine. The reason this need arises is that we may write a packaged program which calls a user-written subroutine. Here is an example of how to do it: \begin{Verbatim}[fontsize=\relsize{-2},numbers=left] sub x { print "this is x\n"; } sub y { print "this is y\n"; } sub w { $r = shift; &$r(); } w \&x; # prints "this is x" w \&y; # prints "this is y" \end{Verbatim} \section{Confusing Defaults} In many cases, Perl the operands for operators have defaults if they are not explicitly specified. Within a subroutine, for example, the array of arguments @\_, can be left implicit. The code \begin{Verbatim}[fontsize=\relsize{-2}] sub uuu { $a = shift; # get first argument ... } \end{Verbatim} will have the same effect as \begin{Verbatim}[fontsize=\relsize{-2}] sub uuu { $a = shift @_; # get first argument ... } \end{Verbatim} This is handy for experienced Perl programmers but a source of confusion for beginners. Similarly, \begin{Verbatim}[fontsize=\relsize{-2}] $line = <>; \end{Verbatim} reads a line from the standard input (i.e. keyboard), just as the more explicit \begin{Verbatim}[fontsize=\relsize{-2}] $line = ; \end{Verbatim} would. \section{String Manipulation in Perl} One major category of Perl string constructs involves searching and possibly replacing strings. For example, the following program acts like the UNIX {\bf grep} command, reporting all lines found in a given file which contain a given string (the file name and the string are given on the command line): \begin{Verbatim}[fontsize=\relsize{-2}] open(INFILE,@ARGV[0]); while ($line = ) { if ($line =~ /@ARGV[1]/) { print $line; } } \end{Verbatim} Here the Perl expression \begin{Verbatim}[fontsize=\relsize{-2}] ($line =~ /@ARGV[1]/) \end{Verbatim} checks \$line for the given string, resulting in a {\bf true} value if the string is found. In this string-matching operation Perl allows many different types of {\bf regular expression} conditions.\footnote{If you are a UNIX user, you may be used to this notion already.} For example, \begin{Verbatim}[fontsize=\relsize{-2}] open(INFILE,@ARGV[0]); while ($line = ) { if ($line =~ /us[ei]/) { print $line; } } \end{Verbatim} would print out all the lines in the file which contain {\it either} the string ``use'' {\it or} ``usi''. Substitution is another common operation. For example, the code \begin{Verbatim}[fontsize=\relsize{-2}] open(INFILE,@ARGV[0]); while ($line = ) { if ($line =~ s/abc/xyz/) { print $line; } } \end{Verbatim} would cull out all lines in the file which contain the string ``abc'', replace the first instance of that string in the line by ``xyz'', and then print out those changed lines. There are many more string operations in the Perl repertoire. As mentioned earlier, Perl uses {\bf eq} to test string equality; it uses {\bf ne} to test string inequality. \section{Perl Packages/Modules} \label{packmod} Object-oriented programming (OOP, see Section \ref{oop}) came late to Perl, as an add-on, so many of the Perl programs being used in the world do not make use of OOP. However, you are likely to encounter it anyway, in the form of modules which you will call from your own code, even though the latter may not be OOP in nature. For example, if you do network programming (see Section \ref{net}), you will probably need to include a line \begin{Verbatim}[fontsize=\relsize{-2}] use IO::Socket; \end{Verbatim} in your code. Let's look at this closely. First, part of your Perl programs environment is the Perl search path, in which the interpreter looks for packages that your code uses. This path has a default value, but you can change it by using the {\bf -I} option when you invoke Perl on the command line. In the above example, the interpreter will look in your search path for a directory {\bf IO}. At that point, the interpreter will consider two possibilities:\footnote{If you know Java, you may notice that this is similar to the setup for Java packages.} \begin{itemize} \item there is a file {\bf IO/Socket.pm} where the package code resides, or \item there is a directory {\bf IO/Socket.pm}, within which there are various {\bf .pm} files which contain the package code \end{itemize} In our case here, it will be the latter situation. For example, on my Linux machine, the directory {\bf /usr/lib/perl5/5.8.0/IO/Socket} contains the files {\bf INET.pm} and {\bf UNIX.pm}, and the socket code is in those files. Each package is typically in a separate file. The {\bf package} keyword begins the file. For use as a module, the file name should begin with a capital letter. For instance, in our example above, the first non-comment line of {\bf INET.pm} is \begin{Verbatim}[fontsize=\relsize{-2}] package IO::Socket::INET; \end{Verbatim} Any package which contains subroutines must return a value. Typically one just includes a line \begin{Verbatim}[fontsize=\relsize{-2}] 1; \end{Verbatim} at the very end, which produces a dummy return value of 1. There are many public-domain Perl modules available in CPAN, the Comprehensive Perl Archive Network, which is available at several sites on the Web. Moreover, the process of downloading and installing them has been automated! For example, suppose you wish to write (or even just run) Perl programs with Tk-based GUIs. If the Perl Tk module is not already on your machine, just type \begin{Verbatim}[fontsize=\relsize{-2}] perl -MCPAN -e "install Tk" \end{Verbatim} You will be asked some questions, but just taking the defaults will probably be enough. \end{document}