Generating a sitemap for a static site

How to move beyond a collection of HTML files

Image for post
Image for post
Photo by Nik Shuliahin on Unsplash
  • lastmod: this is the last modification date of the file; i.e. when the page was last updated. It can include the time if you want, but I’ll just focus on the date here.
  • changefreq: the ‘change frequency’ i.e. how often the content typically changes. This can vary wildly between your home page and a one-off article.

The aim

Once we’re done, we should have a simple script that we can run to generate a sitemap file for our collection of static pages. The script can be run on-demand, or via an automated process, depending on your personal setup.

The setup

Every site — even simple collections of HTML files — is unique, with different requirements, and different standards. In this article, I will use my own site as a reference, and highlight aspects that might vary in your own case.

Localhost

I edit my content on my own Mac laptop before uploading it to a server. It’s easy to use any available text editor to do so, and I have a local web server running to instantly test everything on; your own OS probably includes one by default.

GitHub pages

I use a git repository to keep a full revision history of all content on my site. It’s uploaded to GitHub which provides totally free hosting for static content.

XML-compatible HTML

I do my best to write well-structured HTML, formatted to be XML-compliant. In other words, I close all tags, quote attributes, etc. It’s served as text/html, so it’s not technically XHTML; instead, it’s essentially what is known as ‘Polyglot HTML’.

Canonical URLs

I try to include a canonical URL in every published page. This is not only valuable in its own right, to ensure better search engine results, but it’s also useful when automating any task which might need to determine a page’s URL.

The method

I’ve written a basic PHP script to inspect all the pages in my site, determine their URLs and the last time each was updated, and then produce the final sitemap file.

  • find_files.php is a generic library function that I make use of in many projects. It uses an iterative approach to locate files underneath a given directory.

Details

The approach is as follows:

$files = find_files(".", "html");
$doc = new DOMDocument();
$res = @$doc->load($file);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("/html/head/link[@rel='canonical']/@href");
...
$loc = trim($nodes->item(0)->value);
$mtime = filemtime($file);

The result

Running the script generates a sitemap file containing details of each valid page in the site:

/var/www/mysite $ php ~/bin/generate.php . > sitemap.xml

Final words

A static site doesn’t just have to be a set of HTML files; various site metadata and other features can be provided for static sites as well as those with complex backend architectures. Some of these can be maintained by hand, but you can also automate such processes to save time and eliminate risks.

Technologist & writer, Bobby is currently working on several projects including a management dashboard for static websites and an education portal. bobbyjack.me

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store