1Week Thirteen Announcements
Introduction:
Link of the week
The Common Gateway Interface (CGI)
Not just another editor. This script has many industry first features that you simply have to see for yourself. Update and maintain articles and news items on your web site with this full-featured and extremely flexible content management system.
csMembership is a mySQL driven user management program built around PayPal's subscription services. csMembership interfaces with PayPal to allow for automated user sign-up, cancellations, and reoccurring billing.
A Chat-R-Box is a simple web based chat script that allows you to offer your web site visitors a place to go and chat with each other. Each user can select their own nickname and text color. This script is very easy to install and it is very easy to use. Download it today!
CGI Scripts and PERL
Tommy Yip
The University of Calgary
Calgary, AB
ABSTRACT
This paper highlights the usage of CGI Scripts and PERL. The Common Gateway Interface (CGI) is a standard for interfacing external applications with information servers, such as HTTP or Web Servers. It has provided the first means by which these information servers to be extended to do new or dynamic behaviors and beyond HTML file serving. One of the languages of choices for CGI processing is Practical Extraction and Reporting Language (PERL). PERL is used often because it is specifically designed to butcher multiple text files and format them nicely, making it exceptional for writing HTML. Users benefit from a consistent, powerful, and usable interface environment that can do just about anything web browsers are able to handle. This illustrates the practical usefulness of CGI Scripts and PERL in the demanding interactive World-Wide Web.
INTRODUCTION
The World-Wide Web is a distributed hypermedia information network. Users navigate through the information in mainly static but context-sensitive ways with browsing tools. Browsers are client programs that run on the user's local machine, request information from server programs on remote machines, and display the information to the user. Documents for the World-Wide Web are usually written in the Hypertext Markup Language (HTML). Support for interactive applications on the World-Wide Web is provided by the Common Gateway Interface (CGI). This interface allows transferring information from the browser back to the server, processing the information by programs invoked by the server on the remote machine, and sending the results back to the browser. With this interface users have the choice of using a number of programming platforms/languages, depending on what is available on the system. A popular choice is the Practical Extraction and Reporting Language (PERL) for CGI applications. PERL is an interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information.
This paper presents CGI and PERL concurrently. It will go over basics and specifics involved with the Common Gateway Interface. A look at some PERL basics will be covered and finally issues on overhead and security will be described.
COMMON GATEWAY INTERFACE
A DESCRIPTION
The main components of Common Gateway Interface can be derived in the following way. When we look at the word "Common", we are assuring the user that CGI can be used by many languages and interact with many different types of systems therefore not limiting the user to doing things one way. "Gateway" suggests that CGI's strength lies not only in what it does by itself but in the potential access it offers to other systems such as databases and graphic generators. Finally, "Interface" means that CGI provides a well-defined way to call up its features -- in other words, that the user can write programs that use it.
WEB SERVER INTERACTION
All CGI Scripts interact with a web server the same way. Someone reading an HTML page with a web browser invokes a specific CGI script at a specific web server through its Universal Resource Locator (URL). The HTML page may contain a form to gather some information from the reader. The web server at that location gets the web browser's request and relays user-defined inputs to the CGI Script. The CGI scripts wakes up and reads any inputs provided to it by the web server. The CGI script sees to it that the request is carried out. A response or the results of the request is sent back to the web server that asked for the behavior. These results may be in the form of a new HTML file, a graphics file, a text file - Basically anything that a web browser can read and parse. The CGI script can even write a custom HTML file - Dynamic HTML. The web server thentakes the results from the CGI script and sends it back to the web browser. Finally, the user of the web browser sees (or hears) the results.
The manner (or Protocol) in which the web server passes information to the CGI script and the manner (or Protocol) in which the CGI script returns the results to the web server are fixed and totally described in a Common Gateway Interface standard. Anyone who builds a commercial, shareware or freeware HTTP/Web server supporting CGI follows this Common Gateway Interface standard.
THE BIG PICTURE: WHERE DOES CGI FIT IN?
The web is composed of clients and servers. CGI is used on the server to provide additional services and functionality to the client. The following are other methods of accomplishing similar tasks:
Server Side Includes (SSI): An HTML page is parsed for SSI commands before being sent to the server. Allows for limited dynamic components to be included in the web page.
Internet Server API (ISAPI): A DLL module used by the server ISAPI (for Microsoft IIS), NSAPI (for Netscape's server), and other server-specific packages duplicate CGI functions but integrate more tightly with the server.
JAVA: Allows programs to be run on the client rather than the server. This means that animations and interactivity run more quickly, but some processes, most notably file access, are limited.
JavaScript/VBScript: Client side commands included in the HTML page
ActiveX: "subroutines" which are accessed via VBScript. Functionality is much like Java
ShockWave: Multimedia content including audio, video, animation and interactivity.
Although the above mentioned methods mirror many of CGI's functions, but CGI is a common standard agreed and supported by all major HTTPD's because of its greater portability:
EXAMPLES OF CGI USAGE
With support of Common Gateway Interfaces on web servers, one can do just about anything a web browser can handle. CGI can be used to create forms on web sites that allow the user to enter information, which is processed by CGI and mailed to an administrator or logged. It can be used for on-the-fly pages, which are web pages created dynamically (as needed) with up-to-date information, database interaction, which are an application of on-the-fly pages that use information read from a database, or a web site form can allow a user to update database entries. Logging/Counters are also a common application for CGI. A log file can record traffic data updated with information on each visitor. A counter can be included on the web page to advertise traffic. Further, CGI can be used for animation in which "server-push" programs can be used to feed the client successive images in an animated sequence.
THE SPECIFICATION
Currently, the specification for CGI is version 1.1, or CGI/1.1. Further revisions of this protocol are guaranteed to be backward compatible.
The server and the CGI script communicate in four major ways - Environment Variables, the Command line, Standard Input, and Standard Output. In order to pass data about the information request from the server to the script, the server uses command line arguments as well as environment variables. These environment variables are set when the server executes the gateway program.
ENVIRONMENT VARIABLES
The following environment variables are not request-specific and are set for all requests:
- SERVER_SOFTWARE - The name and version of the information server software answering the request (and running the gateway). Format: name/version
- SERVER_NAME - The server's host name, DNS alias, or IP address as it would appear in self referencing URLs.
- GATEWAY_INTERFACE - The revision of the CGI specification to which this server complies. Format: CGI/revision
The following environment variables are specific to the request being fulfilled by the gateway program:
- SERVER_PROTOCOL - The name and revision of the information protocol this request came in with. Format: protocol/revision
- SERVER_PORT - The port number to which the request was sent.
- REQUEST_METHOD - The method with which the request was made. For HTTP, this is "GET", "HEAD", "POST", etc.
- PATH_INFO - The extra path information, as given by the client. In other words, scripts can be accessed by their virtual pathname, followed by extra information at the end of this path. The extra information is sent as PATH_INFO. This information should be decoded by the server if it comes from a URL before it is passed to the CGI script.
- PATH_TRANSLATED - The server provides a translated version of PATH_INFO, which takes the path and does virtual-to-physical mapping to it.
- SCRIPT_NAME - A virtual path to the script being executed, used for self referencing URLs.
- QUERY_STRING - This information which follows the ? in the URL which referenced this script. This is the query information. It should not be decoded in any fashion. This variable should always be set when there is query information, regardless of command line decoding.
- REMOTE_HOST - The hostname making the request. If the server does not have this information, it should set REMOTE_ADDR and leave this unset.
- REMOTE_ADDR - The IP address of the remote host making the request.
- AUTH_TYPE - If the server supports user authentication, and the script it protects, this is the protocol-specific authentication method used to validate the user.
- REMOTE_USER - If the server supports user authentication, and the script is protected, this is the username they have authenticated as.
- REMOTE_IDENT - If the HTTP server supports RFC 931 identification, then this variable will be set to the remote user name retrieved from the server. Usage of this variable should be limited to logging only.
- CONTENT_TYPE - For queries which have attached information, such as HTTP POST and PUT, this is the content type of the data.
- CONTENT_LENGTH - The length of the said content as given by the client.
In addition to these, the header lines received from the client, if any, are placed into the environment with the prefix HTTP_ followed by the header name. Any - characters in the header name are changed to _ characters. The server may exclude any headers which it has already processed, such as Authorization, content-type, and Content-length. If necessary, the serer may choose to exclude any or all of these headers if including them would exceed any system environment limits.
COMMAND LINE
The command line is only used in the case of an ISINDEX query. It is not used in the case of an HTML form or any as yet undefined query type. The server should search the query information (the QUERY_STRING environment variable) for a non-encoded = character to determine if the command line is to be used, it is finds one, the command line is not to be used. This trusts the clients to encode the = sign in ISINDEX querries, a practice which was considered safe at the time of te design of this specification. If the server does find a "=" in the QUERY_STRING, then the command line will not be used , and no decoding will be performed. The query then remains intact for processing by an appropriate FORM submission decoder. Since this QUERY_STRING contained an unencoded "=", nothing was decoded, the script didn't know it was being submitted a valid query, and just gave the user the default finger form. If the server finds that it cannot send the string due to internal limitations (such as exec() or /bin/sh command line restrictions) the server should include NO command line information and provide the non-decoded query information in the environment variable QUERY_STRING.
STANDARD INPUT
For requests which have information attached after the header, such as HTTP POST or PUT, the information will be sent to the script on stdin. The server will send CONTENT_LENGTH bytes on this file descriptor. Remember that it will give the CONTENT_TYPE of the data as well. The server is in no way obligated to send end-of-file after the script reads CONTENT_LENGTH bytes.
STANDARD OUTPUT
The script sends its output to stdout. This output can either be a document generated by the script, or the instructions to the server for retrieving the desired output.
LANGUAGE OF CHOICE: PERL
There are many choices when it comes down to selecting a programming language to apply to a CGI application. It all comes down to a personal preference. PERL has become one of the most popular choices. Some other widely used languages are C, C++, TCL, BASIC and shell scripts. Reasons for choosing PERL include its powerful text manipulation capabilities (in particular the 'regular' expression) and the fantastic WWW support modules available.
PRACTICAL EXTRACTION AND REPORTING LANGUAGE
A DESCRIPTION
Practical Extraction and Reporting Language (PERL) is an interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. PERL is also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal). It combines some of the best features of C, SED, AWK, and SH, so people familiar with those languages should have little difficulty with it. Unlike most UNIX utilities, PERL does not arbitrarily limit the size of your data. PERL uses sophisticated pattern matching techniques to scan large amounts of data very quickly. Although optimized for scanning text, PERL can also deal with binary data, and can make dbm files look like associative arrays (where dbm is available).
PERL is not a platform-dependent language. PERL was originally designed for UNIX systems but it has been ported to a variety of platforms. A PERL program written for use on a UNIX box will run (more or less) perfectly on a PC box. There are incompatibilities between versions and platforms, but they are minor.
THE BASICS
As mentioned in the previous section, PERL has a mix of features from a variety of other programming languages. Simple semantics and syntax will not be covered here, since it is beyond the scope of this report. However, receiving user input from forms and sending information back to the user will be discussed.
RECEIVING USER INPUT FROM FORMS
Commonly, most interactive environments on the Web involve forms. The HTML codes for generating forms, has the underlying <FORM> tag that requires two arguments: METHOD and ACTION. The ACTION is the URL representing the script which is to receive the form information. The METHOD (either GET or POST) represents the way in which the information will get passed to the script. GET is slightly more limited (mostly in maximum length of information it can pass), but is slightly easier to deal with. If there are substantial text entry fields (esp. TEXTAREAs), the POST method should be used. The difference between these two methods is in the way information is passed.
Using METHOD="GET" we have:
- FORM elements' (INPUTs, TEXTAREAS, SELECTs, etc.) names are paired with their contents. As an example, suppose the following HTML is part of a form:
<input type="text" size="9" maxlength="9" name="zip">
into which the user entered 10003. These would be joined together with an = to make: zip=10003.
- All such name/value pairs are joined together with an &.
- The entire string is URL encoded. The resulting string from the example above is:
Name=Jane+Doe&address=35+W%27+4th+St%27&zip=10003
The string is then passed to the ACTION script in the environment variable QUERY_STRING.
With METHOD="POST", its much the same except for the number 3. For a POST, the encoded string is passed to the script's STDIN, and the length of the string in bytes is passed in the environment variable CONTENT_LENGTH.
The advantage of the "GET" method is that it can process command line variables. The disadvantage is that the input string is of limited length.
SENDING INFORMATION BACK TO THE USER
There are only a couple of basic things needed for sending appropriate information back to the user:
- Print Commands - Generally, print commands will send information to the STDOUT, which will get passed directly to the user's browser. This is generally what is passed. In more advanced PERL applications, information can be printed to a file, in which case an awareness of where print commands are sending information needs to be made
- Header Information - The HTTP standard includes header information which tells the browser what to do with what the information it receives. The browser will interpret everything it receives, up until the first blank line, as header information. Providing outgoing header information with user output is required.
- CONTENT-TYPE - This is a borrowed element from the MIME standard. The browser at the receiving end doesn't know what sort of information it's going to get in response to the query it just sent, so it has to be told. Generally, the first thing that should be printed is "CONTENT-TYPE: text/html\n\n"; Anything that is printed after that will be interpretted by the user's browser as HTML, just as if it had come from a regular HTML text file.
- LOCATION - Sometimes the owner wouldn't want to print their own HTML to a user, but will want to send that user to some other URL. The location header can be used to accomplish this. If "Location: is printed with nothing else (no other HTML or "CONTENT-TYPE" or anything), the user's browser will send a request to the specified server for the page at that URL.
COMMON PROBLEMS