Digital Preservation: Code Fixes

In my blog post about how to do digital preservation I covered setting up a legacy PHP project using Docker, building a static version of the website and then deploying the HTML to GitHub Pages. However, the decade-old+ projects needed a little tinkering to work correctly.

Digital Archaeology

The websites were built for the Apache web server and used .htaccess files to create nice URLs. This technique is employed so the website has friendly URLs but it also hides the technology used; there are no .php suffixes in sight. This also helped maintain the same URL structure for the static-generated website, which is now serving .html files – but you would never know!

However, – for whatever reason – a lot of these projects were missing the important .htaccess file… So, I had to do a little digital archaeology. I had to reverse engineer the mod_rewrite rules to build the correct query parameters by investigating the code. Luckily the projects were relatively small and I was able to map the $_GET parameters expected in PHP to the URLs the website was using.

Database Issues

I had to change from using mysql_query to mysqli_query and that also involved passing the connection as the first parameter. Once these changes were made, the database-driven projects started showing a lot more content.

For two of the projects, I was missing a database export that was required for certain sections. The website still worked but was missing core functionality. Again, I investigated the codebase, looking for SELECT MySQL statements and then rebuilt the database tables to match. Once I had a basic structure, I populated a few rows and the website sprung to life!

Paths and Other Issues

With the .htaccess and MySQL database issues solved, there were a few other issues that cropped up. This was mainly down to URL paths missing a starting slash, so when the build script ran it ended up in a loop, generating folders deeper and deeper!

One website complained about “no default timezone set”, which was an easy fix by adding date_default_timezone_set.

A bizarre issue was that GitHub Pages didn't seem to support files which contain [] characters. This was was how one project named the images so, although they worked locally and in the build folder, they showed the GitHub Pages 404 page when deployed. I had to rename these files and references to them in the codebase and they then worked correctly.