Archiving a dynamic site as static content

(And making pretty urls for static content)

I recently archived archived my old drupal site by creating a static copy of it. Thus for every url that existed in the old site there now exists a static page in the archive directory with the html extension.

So the page that once existed at /rails now exists at /archive/rails.html

Getting a html static archive of your site is fairly easy. If your blogging software uses a static page cache, then you can use the contents of that page cache and hammer your site with wget to generate the cache for every page. Mephisto does this caching by default. For drupal the boost module will implement a static page cache. Most good blogging platforms have a way of generating a static cache of the site. Alternatively you could do it yourself using wget -mirror.

Once you have a static archive generated you need to ensure that these files are served for all the urls that are associated with your old site. The easiest way to do this is to check if the archive contains a file with the requested path and a html extension and if so, serve it up.

Below is an set of apache rewrite rules for doing this.

This works by checking each request for an archived file:

  1. If there exists a document that matches the requested path with .html appended serve that. This catches all unarchived content that is statically cached.
  2. If there exists a document in the archive directory that matches the requested path with .html appended serve that. This catches all the archived html pages.
  3. If there exists an item in the archive directory that matches the requested path then serve that. This catches archived images, css, pdfs etc.
RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_URI}.html -f
RewriteRule ^(.*)$ /$1.html [L]

RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_URI}.html !-f
RewriteCond %{DOCUMENT_ROOT}/archive/%{REQUEST_URI}.html -f
RewriteRule ^(.*)$ /archive/$1.html [L]

RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_URI} !-f
RewriteCond %{DOCUMENT_ROOT}/archive/%{REQUEST_URI} -f
RewriteRule ^(.*)$ /archive/$1 [L]