Digital Preservation How-to

I have been doing a small part of digital preservation for some of my legacy projects. Restoring old projects that were built many years (decades) ago is no easy task; supported versions move quickly and it can be a lot of work to get them working again.

A lot of my projects were built using the LAMP stack – Linux, Apache, MySQL and PHP – but require old unsupported versions to work. It would be a laborious task to upgrade these projects to work on newer technology – and keep them maintained – especially when it is just for preservation.

Once the project is working locally, I generate a static HTML version. This does mean dynamic functionality, such as contact forms, will no longer work. But the majority of the content should be preserved. A static version of the website that is no longer maintained has two benefits;

  1. any security issues or server-side bugs are no longer an issue.
  2. ability to deploy the project to hosting very easily.

I have used Docker to resurrect the old technology the websites were built on, generating the static HTML of each page and deploying them to GitHub. Here is a summary of how I achieved this.

Using Docker

Getting started with Docker, you can create self-contained images of operating systems and define what programs are installed on them. This is a perfect way to create environments with a specific set of requirements while keeping them isolated from your main machine and meaning they don't interact or collide with other versions of the same programs. You can also define the images using code, which means they're reproducible.

I had a few simple projects that I wanted to resurrect. For the most basic one, I didn't need a database, so I created a simple Dockerfile which creates an image using the deprecated PHP version the project needed.

The Dockerfile extends from an official PHP image, in this case 5.6, and includes Apache. I then enable mod_rewrite for .htaccess functionality the website needed. The image copies all the files from the current root directory to the default Apache web root and then it exposes the default web port 80.

FROM php:5.6-apache as project_name

RUN a2enmod rewrite

COPY ./ /var/www/html/

EXPOSE 80

This Dockerfile is used with a basic docker-compose.yaml configuration. When starting the container using docker-compose it uses the Dockerfile image defined above. It mounts the root directory to allow changes you make to the local files to be instantly visible.

By default, the configuration maps port 8000 on your local machine to the exposed port and default Apache port defined above. You can alter the local port by creating an .env file and setting the port using FORWARD_APP_PORT. This is useful if you have a few projects you're working on at the same time.

version: "3.7"

services:
    website:
        build:
            context: .
            dockerfile: Dockerfile
        image: project_name:latest
        container_name: project_name
        ports:
            - '${FORWARD_APP_PORT:-8000}:80'
        volumes:
            - ./:/var/www/html

Now we can start the Docker container by running the following command:

docker-compose -f ./docker-compose.yaml up -d

When running docker ps -a you should see the container has started. You should be able to visit your website in a browser at http://127.0.0.1:8000 (or whatever port you configured to use). Hopefully your website looks good and everything works!

Generating the static website

The next stage is building a static HTML version of your website. Static-site generators such as Eleventy are incredibly popular because they take your organised templates and content and generate HTML that can be served easily and for a low cost (free). We need to do a similar thing but from an existing website.

I found Snap, a Node script written by Remy Sharp, that is the ideal tool for this. It is a simple tool that crawls your website's internal links. It generates the HTML files inside folders that match the URL, meaning it keeps the pretty URLs that Apache mod_rewrite allowed.

The command is simple; it takes the URL you want to crawl and optionally the directory you want to write the files to, which I define as www.

snap http://127.0.0.1:8000 -o www

Running this command should give you all your website's content inside the www directory, saved as full HTML pages, as well as the assets such as CSS, JavaScript and images.

Deploying to GitHub pages

The final stage of the process is to make the project available on the Web. There are a few places where you can easily serve static websites. I have used Netlify in the past, but for these projects, I am going to use GitHub Pages.

The projects are tracked using git version control and I host the repository on GitHub, so using GitHub Pages makes sense. I found a Node script called push-dir which pushes a directory to a remote branch on your repository.

Note, you should add the www directory to your project's .gitignore as this is a build-directory and shouldn't be tracked.

push-dir --dir=www --branch=gh-pages

Running this command will upload the www directory to the gh-pages branch and make it available in your browser at http://<user>.github.io/repository_name/.

Because the files are in a sub-directory it is likely your website won't work correctly, as paths are normally pointing to the root path…

Using a custom domain

I wanted to host these projects using a subdomain of my main website. This also solves the issue of the files being in a sub-directory. You can do this with the same GitHub Pages solution.

Firstly you need to configure your domain DNS by creating a CNAME record which points to <user>.github.io. Secondly, you need to create a CNAME file which includes the domain you're using.

Now when building the HTML, the cname file must be copied to the www directory so it is uploaded and configures GitHub Pages correctly. We can modify the snap script to copy the file after the site has been crawled:

snap http://127.0.0.1:8000 -o www && cp CNAME ./www/

Hopefully you now have your legacy PHP project statically generated and hosted on GitHub Pages using a custom domain.


Instead of remembering all the different commands, I wrap them up as NPM scripts in the package.json. Now we can start the project by running npm run start. Building and deploying are also aptly named.

Running npm run build will make sure the project is built – with the predeploy script – and the build script makes sure the Docker instance is running, using the predeploy script.

{
    "name": "project_name",
    "title": "Project Name",
    "description": "Project Description.",
    "private": true,
    "scripts": {
        "start": "docker-compose -f ./docker-compose.yaml up -d",
        "stop": "docker stop project_name",
        "remove": "docker rm project_name",
        "prebuild": "npm run start",
        "build": "snap http://127.0.0.1:8000 -o www && cp CNAME ./www/",
        "predeploy": "npm run build",
        "deploy": "push-dir --dir=www --branch=gh-pages"
    },
    "devDependencies": {
        "@remy/snap": "^1.1.1",
        "push-dir": "^0.4.1"
    }
}