Multipart CGI forms in Perl

January 2011

I started Common Gateway Interface (CGI) server-side scripting of web pages before the advent of Perl packages for the job, such as the excellent CGI module. While the CGI module does most anything you'd want for Perl CGI scripting, I usually find that I want less between me and my HTML, and that includes forms. In this article we'll review HTML forms, and in particular look at scripting form parsers directly in Perl for POST submit methods, including file uploads.

Forms

We won't cover all aspects of form construction, as there are many useful resources on-line, such as the Tutorials Point articles. A form is included in your HTML via a <form> element. For example, the following simple form might allow you to enter a search into your search engine:

<form method="GET" action="mysearch.cgi">
   Enter a search term:
   <input type="text" name="search" size="20"></input>
   <input type="submit" name="lookup" value="find"></input>
</form>

When the user clicks on the find button, the form will be submitted to the action URL, in this case "mysearch.cgi", using the method specified (method GET in the example). As a rule of thumb, you should use POST methods only where the act of submitting the form would permanently alter the state of a system, such as updating a database or sending a comment. GET methods are typically preferred for pure queries, which can be repeated over and over. What happens next is dependent on the form's submission method.

For GET forms, the name=value pairs in the form are encoded and passed as a suffix to the action URL, following a ? separator. For the search form, this might see a URL of http://my.domain.org/mysearch.cgi?search=leopards passed to your mysearch.cgi script if you entered "leopards" as the search term. The web server will pass the query part of the URL to the CGI script via its QUERY_STRING environment variable, assuming that variable REQUEST_METHOD has the value GET.

For POST forms, the name=value pairs in the form are encoded as before, but passed to the CGI script via its standard input (STDIN file descriptor), without adorning the URL. The CGI script will instead see environment variable REQUEST_METHOD with the value POST.

Two Kinds of POST Method

Ordinarily, using GET or POST methods, as described above, is all you'll need. But things get complicated once you start trying to use forms for file upload. A specification adopted broadly for file upload is formally described in IETF RFC 1867. This adds a special input type, type="file", and requires the form to not only use a POST method, but also define the attribute enctype="multipart/form-data". An example of a file upload form might look like this:

<form enctype="multipart/form-data" method="POST" action="upload.cgi">
   Please choose a file to upload:
   <input type="file" size="30" name="uploadfile">
   <input type="submit" name="upload" value="Upload File">
</form>

While your CGI script still sees the environment variable REQUEST_METHOD set to POST, you won't be able to pull apart the form just based on this information. Rather, to cater for file uploads, your script also needs to check as to what kind of POST has been submitted.

An ordinary POST form will be passed with the environment variable CONTENT_TYPE set to application/x-www-form-urlencoded (but make sure you check this case insensitively, using the i switch in Perl regular expression matches). In this case, parse stdin as a normal URL-encoded set of name=value pairs. A multipart form, however, will see the environment variable CONTENT_TYPE start with the string multipart/form-data; instead. This must be parsed in a very different fashion to URL-encoded data.

Handling Multipart POST Forms

Officially, multipart forms are treated according to RFC 1867 and the MIME specification in RFC 1521. It's important to note that the entire form, not just any file being uploaded, is sent in the MIME format. This means that you will need to extract any other name=value pairs from the same data stream as the file data. It also means that you do not have to even upload a file to make use of multipart encoded POST forms!

Whereas line endings vary across operating systems, RFC 1521 specifies MIME line endings as strictly a CR+LF (carriage return, ASCII 13, and linefeed, ASCII 10) character pair. The basic approach to processing the multipart form involves reading it line by line (recognising only CR+LF as an end of line delimiter), looking out for specific tag strings that mark the elements of each form part. Each part of the form starts with a MIME boundary string: a unique and often random collection of characters that identifies the start of each part. The boundary string is supplied to your CGI script inside the CONTENT_TYPE environment variable, which might appear fully as: "multipart/form-data; boundary=boundary_string".

Suppose we have the following file upload form, which includes non-file input:

<form enctype="multipart/form-data" method="POST" action="upload.cgi">
   Filed by correspondent:
   <input type="text" name="user" size="20"></input>
   <br />
   Please choose a file to upload:
   <input type="file" size="30" name="uploadfile">
   <input type="submit" name="upload" value="Upload File">
</form>

The raw text seen on stdin of your upload.cgi script might appear as follows when submitted from Firefox:

-----------------------------1137522503144128232716531729CRLF
Content-Disposition: form-data; name="user"CRLF
CRLF
BarneyCRLF
-----------------------------1137522503144128232716531729CRLF
Content-Disposition: form-data; name="uploadfile"; filename="file.txt"CRLF
Content-Type: text/plainCRLF
CRLF
Fred FlintstoneCRLF
-----------------------------1137522503144128232716531729CRLF
Content-Disposition: form-data; name="upload"CRLF
CRLF
Upload FileCRLF
-----------------------------1137522503144128232716531729--CRLF

The MIME line endings are shown as CRLF in the dump. In this case, the user entered "Barney" in the correspondent field, and uploaded a text file, "file.txt". with content "Fred Flintstone" (no end of line). The boundary string will vary from one browser to another, and is likely to change each time the form is submitted. However, each form part starts with --boundary_string. The final part is ended with an extra two dashes, --boundary_string--. Note that the CR+LF pair immediately preceding each --boundary_string is, in fact, considered belonging to the delimiter, and is not part of the form's data. That's why the value "Barney" for input "user" was terminated with CR+LF.

Getting Multipart Content Reliably

The easy bit is obtaining the values of regular form variables. Locate the next boundary delimiter, scan the content attributes (such as Content-Disposition:) to associate a name with each variable, and scan for the first empty line (i.e. containing nothing but CR+LF). The variable's value starts after the empty line, and continues until the next boundary delimiter. But remember not to include as part of the value the CR+LF pair just prior to the boundary.

Things get more interesting when it comes to extracting file upload content. This is because the file may be any old binary file, and not necessarily well-behaved text. In fact, the same can be said for variable values, but it's less likely to be an issue. In particular, a robust parser must cater for the very real possibility that no CR+LF pair will be seen until the end of, perhaps, many megabytes of file data!

Two bits of defensive programming will get you out of trouble. Firstly, ensure that you read stdin as a binary stream, and not as a text file as defined by your O/S and your Perl installation. Just in case, include binmode STDIN; before you read any form data. Secondly, deliberately limit the number of bytes that you buffer on each read of stdin. In this case, it's more than likely that your buffer will not have a CR+LF pair in the midst of reading file content. However, eventually you'll see CR+LF either as part of the content, or when the boundary delimiter is reached.

Here's a Perl subroutine that returns the next MIME line of data from STDIN, but bounds the read where no CR+LF pair appears.

# Sub-Routine: read_bounded_line FILEHANDLE
# Reads the file referenced by FILEHANDLE to obtain the next
# MIME terminated line, or a partial line if no CRLF pair is seen
# in the buffer. read_bounded_line ends on EOF or an error.
sub read_bounded_line
{
   my $fh = shift;  $ get file handle: reference to hash
   my $rbuf, $nread, $res;
   my $toread = ($fh->{LENGTH} > 8000)? 8000 : $fh->{LENGTH};

   # if more to read and buffer below size, add some more to the buffer
   if ($toread && length($fh->{BUFFER}) < 8000) {
      $nread = read(STDIN, $rbuf, $toread);
      if ($nread > 0) {
         $fh->{LENGTH} -= $nread;
         $fh->{BUFFER} .= $rbuf;
      } else {
         # either EOF or error case!
         $fh->{LENGTH} = 0;
      }
   }
   # extract first line terminated by CRLF pair
   # note use of non-greedy *? quantifier on .
   if ($fh->{BUFFER} =~ s/^(.*?\r\n)//s) {
      $res = $1;
   } else {
      # no CRLF to be seen, but possible last CR could start a CRLF,
      # so push that back for later
      if ($fh->{LENGTH} > 0) {
         $res = substr($fh->{BUFFER}, 0, -1);
         $fh->{BUFFER} = substr($fh->{BUFFER}, -1, 1);
      } else {
         $res = $fh->{BUFFER};
         $fh->{BUFFER} = '';
      }
   }
   return $res;
}

We set up a hash reference for the file handle like this, assuming that variable $maxform specifies the largest POST form length you wish to support:

   my $fh = {
      LENGTH => ($ENV{'CONTENT_LENGTH'} > $maxform)?
                $maxform : $ENV{'CONTENT_LENGTH'},
      BUFFER => ''
   };
   ...
   while (more_input($fh)) {
      $_ = read_bounded_line($fh);
      # do stuff with $_
   }

Note the use of another utility subroutine to check for end of file:

# Sub-Routine: more_input FILEHANDLE
# Returns true if the file referenced by FILEHANDLE has more bytes to be read.
sub more_input
{
   my $fh = shift;
   return $fh->{LENGTH} || length($fh->{BUFFER});
}

Putting It All Together

Now that we know what to look for, and we can reliably read the form data, it's time to put it all together in a general-purpose CGI form parser that handles GET methods as well as both kinds of POST. To start with, we need to support simple GET and POST forms with URL-encoding, and do so with the following parsing routine:

# Sub-Routine: split_urlencoding URL
# Splits the given URL string into name=value pairs on & boundaries,
# and sets the %FORM hash with each value.
sub split_urlencoding
{
   my $url = shift;
   my $name, $value, $pair;
   my @pairs = split(/&/, $url);
   foreach $pair (@pairs) {
     ($name, $value) = split(/=/, $pair);
     $name =~ tr/+/ /;
     $name =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
     $value =~ tr/+/ /;
     $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
     $value =~ s///g;
     $FORM{$name} = $value;
   }
}

Finally, here's the Perl subroutine that parses the form. We assume that all of the previous subroutines appear within scope.

# Sub-Routine: parseform MAXFORM FILEDIR
# Parses the CGI form on either the environment variable QUERY_STRING for GET
# methods or the STDIN stream for POST methods.  A form larger than MAXFORM
# bytes is limited to those bytes.
# The parsed data is stored in the %FORM hash. To access, use the form 
# $FORM{'field_name'}.  In the case of a file upload, the form value is the actual
# upload path in full, but the file data is actually written to directory FILEDIR,
# which is created by parseform if not already present.  Empty filenames are skipped
# and the form name for the file omitted as if the file was never sent.
# Return 1 if successful, 0 on error.
sub parseform
{
   my ($maxform, $filedir) = @_;
   my $inform, $infile, $boundary, $lastline, $name, $value;
   my $fh = {
      LENGTH => ($ENV{'CONTENT_LENGTH'} > $maxform)?
                 $maxform : $ENV{'CONTENT_LENGTH'},
      BUFFER => ''
   };
   if ($ENV{'REQUEST_METHOD'} eq 'GET') {
      split_urlencoding($ENV{'QUERY_STRING'});
   } elsif ($ENV{'REQUEST_METHOD'} eq 'POST') {
      binmode STDIN;
      if ($ENV{'CONTENT_TYPE'} =~ /application\/x-www-form-urlencoded/i) {
         read(STDIN, $value, $ENV{'CONTENT_LENGTH'});
         split_urlencoding($value);
      } elsif ($ENV{'CONTENT_TYPE'} =~ /multipart\/form-data;\s*boundary=(\S+)/i) {
         $boundary = $1;
         $inform = 0;
         while (more_input($fh)) {
            $_ = read_bounded_line($fh);
            if (!$inform) {
               $inform = /^--$boundary/;
            } else {
               if (/content-disposition:\s*form-data;\s*name=\"([^\"]*)\"/i) {
                  # found start of header: extract form data name
                  $name = $1;
                  $value = '';
                  $infile = 0;
                  # is this header a file name?
                  if (/filename=\"([^\"]*)\"/i) {
                     # only process file name if one was given
                     if (length($1) > 0) {
                        $value = $1;
                        $value =~ s/://g;       # remove colons
                        $value =~ s/\\/\//g;    # turn backslashes to slashes
                        $value =~ s/^(.*\/)*//; # delete all leading path elements
                        $value =~ s/\s/_/g;     # replace whitespace with underscores
                        $value = $filedir . $value;   # add upload path
                     }
                     $infile = 1;
                  }
               } elsif (/^\r\n$/) {
                  # an empty line marks end of header, so process content that follows
                  if ($infile) {
                     # does our upload directory exist?
                     mkdir($filedir, 0700) unless (-d $filedir);
                     # only process file content if required and we can write it
                     if (length($value) > 0 && open(SAVEFILE, ">$value")) {
                        binmode SAVEFILE;
                        $lastline = '';
                        COPYLINE: while (more_input($fh)) {
                           $_ = read_bounded_line($fh);
                           if (/^--$boundary/) {
                              $lastline =~ s/\r\n$//;
                              last COPYLINE;
                           } else {
                              print SAVEFILE $lastline;
                           }
                           $lastline = $_;
                        }
                        print SAVEFILE $lastline;
                        close SAVEFILE;
                        # form name gets the file path
                        $FORM{$name} = $value;
                     } else {
                        # skip over unused file content to next boundary
                        SKIPLINE: while (more_input($fh)) {
                           $_ = read_bounded_line($fh);
                           last SKIPLINE if (/^--$boundary/);
                        }
                     }
                  } else {
                     # we have normal content: pack it until we see a boundary
                     $lastline = '';
                     COPYVALUE: while (more_input($fh)) {
                        $_ = read_bounded_line($fh);
                        if (/^--$boundary/) {
                           $lastline =~ s/\r\n$//;
                           last COPYVALUE;
                        } else {
                           $value .= $lastline;
                        }
                        $lastline = $_;
                     }
                     $value .= $lastline;
                     # form name gets the content
                     $FORM{$name} = $value;
                  }
                  # check for -- suffix to mark end of form
                  $inform = (/^--$boundary--/)? 0 : 1;
               }
            }
         }
      }
      return 1;
   } else {
      # strange form type: return error
      return 0;
   }
}

Note that subroutine parseform immediately saves files to the directory specified, which could be either a temporary directory or a permanent upload area for your script. The form value for each uploaded file is returned with its resulting save path. Since some browsers (or malicious users) can provide aribtrary paths as a file name, parseform makes an effort to sanitise file names, removing path elements, drive specifiers, and special characters.

You may elaborate the error handling to suit your needs, particularly with regards to forms exceeding the specified content length. Currently, parseform simply truncates forms longer than the limit.

Although form handling may not be trivial, I hope you agree that it's easy enough to script yourself in a way that can be understood and customised to your specific needs.

Bernard Gunther

SETUP & HOLD