Generating a sitemap for a static site
How to move beyond a collection of HTML files
I’ve previously written about the benefits of static sites and how to layer dynamic functionality on top of them. In this article I want to introduce a simple technique for adding a sitemap to your static site.
A sitemap file can be used by search engines, or any other software, to find out about the pages available on your site. It’s essentially a machine-readable index which lives on your site alongside your content. The XML-based sitemap protocol is very straightforward and an excellent starting point for those looking to evolve their knowledge beyond plain HTML.
The sitemap contains a list of URLs with some additional, optional information. You can experiment with the data you wish to include, but in this example, I’ll be including the following:
loc
: this is required, and it’s simply the URL itself.lastmod
: this is the last modification date of the file; i.e. when the page was last updated. It can include the time if you want, but I’ll just focus on the date here.changefreq
: the ‘change frequency’ i.e. how often the content typically changes. This can vary wildly between your home page and a one-off article.
In theory, you can update your sitemap file manually, in a text editor, just as you might edit your HTML files. This is perfectly feasible for a small, rarely update site, but automating the task gives you one less thing to worry about keeping up-to-date.
The aim
Once we’re done, we should have a simple script that we can run to generate a sitemap file for our collection of static pages. The script can be run on-demand, or via an automated process, depending on your personal setup.
The setup
Every site — even simple collections of HTML files — is unique, with different requirements, and different standards. In this article, I will use my own site as a reference, and highlight aspects that might vary in your own case.
Localhost
I edit my content on my own Mac laptop before uploading it to a server. It’s easy to use any available text editor to do so, and I have a local web server running to instantly test everything on; your own OS probably includes one by default.
GitHub pages
I use a git repository to keep a full revision history of all content on my site. It’s uploaded to GitHub which provides totally free hosting for static content.
XML-compatible HTML
I do my best to write well-structured HTML, formatted to be XML-compliant. In other words, I close all tags, quote attributes, etc. It’s served as text/html
, so it’s not technically XHTML; instead, it’s essentially what is known as ‘Polyglot HTML’.
The important factor here is that I can parse it easily using XML libraries such as the DOM-handling built into PHP.
Canonical URLs
I try to include a canonical URL in every published page. This is not only valuable in its own right, to ensure better search engine results, but it’s also useful when automating any task which might need to determine a page’s URL.
The method
I’ve written a basic PHP script to inspect all the pages in my site, determine their URLs and the last time each was updated, and then produce the final sitemap file.
The script is available in this GitHub repository, and consists of just two files:
generate.php
, when run, will create the sitemap file,sitemap.xml
, based on all.html
files in the current directory and below.find_files.php
is a generic library function that I make use of in many projects. It uses an iterative approach to locate files underneath a given directory.
Details
The approach is as follows:
Line 16: Using the find_files()
function, get all the html files inside the current directory:
$files = find_files(".", "html");
The resulting array contains a list of full paths for each file, relative to the current directory.
Lines 23–34: Loop through this array of files, loading each one as an XML document:
$doc = new DOMDocument();
$res = @$doc->load($file);
Lines 31–32…39: Get the value of the canonical URL using XPath:
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("/html/head/link[@rel='canonical']/@href");
...
$loc = trim($nodes->item(0)->value);
Note that you can use other means of determining the URL if you don’t use canonical URLs (although I recommend this approach). For example, you can trim the document root prefix from your file’s full path, prepend your domain, and remove the file extension.
Line 46: Get the modification time using filemtime()
:
$mtime = filemtime($file);
Note that an alternative, if you store your files in a revision control system such as git, is to fetch the time of the last commit for that file. This can be advantageous if, for example, you have local changes which haven’t yet been published.
Lines 50–60: There then follows some code to determine — approximately — the change frequency. This is very ‘dumb’: it simply looks at how long ago the file was last modified and assumes that is ‘typical’. For this example, and many real-world sites, this is perfectly acceptable; changefreq
is merely a ‘hint’ to search engines, and may be ignored anyway.
Again, if you store your files in a revision control system, you can get a lot more sophisticated. You could, for example, look at a number of previous commits for each file and more accurately determine its change frequency.
Lines 62–70: Finally, the XML markup is manually accumulated in a string and, ultimately, echo
ed to standard output. There are functions in the DOM library to do this more robustly, but for a simple case such as this, they’re not really necessary.
The result
Running the script generates a sitemap file containing details of each valid page in the site:
/var/www/mysite $ php ~/bin/generate.php . > sitemap.xml
Adapt this command to your local environment depending on the location of your site’s document root and the generate.php
script you downloaded.
Note that the final file should be named exactly sitemap.xml
and should reside at the top-level of your document root so that if your site is example.org
, its URL will be https://example.org/sitemap.xml
.
That’s it! Simply run the script, as above, whenever you update your content and search engines will be able to find it, and index it, with the minimum of fuss.
Final words
A static site doesn’t just have to be a set of HTML files; various site metadata and other features can be provided for static sites as well as those with complex backend architectures. Some of these can be maintained by hand, but you can also automate such processes to save time and eliminate risks.