Dave Cater

Perl - processing HTML

 
Home Page
Career outline
Java
Linux
Security
Perl
Perl references
Perl - processing HTML
Perl - for portable scripts
System management
Testing
Musical notes
My first idea for a home project was to help with a little problem all of my own making. The side navigation bar used by these HTML pages was copied into each page manually. This means changes to the site layout result in changes to the hyperlinks, and so require changes to the page contents. Of course, this will soon become a maintenance nightmare!

Now, in these days of content management, there are probably lots of jolly good commercial tools which allow you to generate HTML pages with content being stored separately from presentation, to get around this problem. But I had a feeling I could produce something simple for my own purposes. Maybe a Perl script could be used to automatically generate the rows and insert into the pages before publication ?

As a starting point, I was delighted to find Perl version 5 supplied as part of my Red Hat Linux 5.2 distribution. I decided to use this, taking advantage of the fact that the Windows directories used for my HTML pages can be mounted from within Linux. I developed a perl script html_mklinks.pl which processes HTML format input files supplied as command line arguments. Each input file is copied to an output directory, with the hyperlinks updated to refer to related files. The script has also been tested using a DOS session within Windows98. See the project Perl - for portable scripts. The output directory defaults to "/tmp", so you need to use the -o option to specify a directory when running the script under DOS.

How are files related ? I envisaged a hierarchical structure, with a series of primary files each of which may have secondary files related to it covering more detailed topics on the same subject. These in turn may have related higher level files, and so on. I wanted to be able to navigate to the related higher level files, whilst at the same time still displaying the links to the lower level files I chose not to access on the way. Time alone will tell whether this navigation scheme is a good one, but with a few small changes to the perl script, the scheme could easily be altered.

I developed a file naming scheme, prefixing each file name with a sequence of digits separated by underscore characters. The URL of this page is an example. The number of underscore characters defines the "level" of each file. Related files have the same sequence of digits (except for the last component). For each input file specified, the perl script works through each file whose name is of the form [0-9]*.html, deciding whether its related, and if so, placing it in the table of hyperlinks in the corresponding output file. The name used for the hyperlink is taken from the HTML title tag in the related file, if there is a title, otherwise the file name itself is used. In order to know where to insert the table of hyperlinks, I developed a simple scheme involving HTML comments before and after the table of hyperlinks. These comments include markers which the perl script recognises.

For more details of the implementation of the script, download html_mklinks.pl here and read the internal comments.

In recent times I have been using linklint - a Perl based command line tool - for checking correct linkage within the web pages. I found that the best command to use is:

linklint -doc lintdoc /@ -net

This checks all the html files in the local directory and sub-directories (/@ just means check the whole site). The -net option means check for the existence of non-local URLs. Finally the -doc option puts the results in a directory with the specified name. The most useful results file to review is urlindex.html.