Recovery on a salvaged PostgreSQL cluster with missing files

last updated: 2024-07-29

Cover art for this post, showing a distorted bright background, with a blooming PostgreSQL logo on the front

Hi! It's been a while. This post has been in my drafts since November of last year. I finally decided to finish it up, since the information presented here was quite hard to pinpoint.

As some of you may be aware of, my main server suffered an SSD failure on October 14th, 2023, with no backups available. This SSD had the data for many of the services I host, including my Akkoma instance, the mint.lgbt Matrix and XMPP servers, and even this blog! Thankfully, with the help of the donators now listed in the Coral Castle homepage, I sent it to a local, Chilean data recovery service called Nekiori, located in the Los Leones commercial neighborhood of the Providencia district. They were successful at the job, managing to recover around 300 GB of data, not to mention their excellent communication and pricing.

However, no data recovery job is perfect, and in many cases you'll still end up with a few files missing. This is what happened with the salvaged PostgreSQL cluster, and I'll explain how I was able to get a huge chunk of it back within about 2 hours.

Starting up the salvaged cluster

Before beginning, MAKE A COPY OF THE ORIGINAL DATA DIRECTORY. We'll be doing lots of destructive changes, so, if anything goes wrong, you'll want to have a checkpoint you can restart from. I know duplicating clusters is time-consuming, and may take up lots of space, but it's best to be safe than sorry (although, I'm expecting whoever has to go through this to already have learned this valuable lesson). Additionally, this guide only consists of getting past PostgreSQL errors, not regenerating any pages that were permanently lost.

Let's start by installing PostgreSQL on our work environment, making sure to match the version the recovered cluster was originally running on. On Alpine Linux, for instance, we can specify the cluster's version to be 15.x by running:

# apk add postgresql15

Once installed, we'll skip using the daemons provided by the package, and instead we'll run the server straight from the CLI, in order to have better control of each parameter. I would recommend using a multiplexer like Byobu, in order to view logs easily and be able to gracefully shut down the database when needed. Otherwise, if you're working on a full blown desktop, you can use multiple terminals. Let's break down the following command:

$ postgres -k /tmp -p 5433 -D postgresql/15/data/

After we've each parameter, let's run it!

...

All good? No? Don't worry. It's likely to fail its first launch, complaining about it not being able to find a directory in the data folder. If that's the case, create the missing directory (should be shown in the log) with:

$ mkdir [data]/[dir]

Then, try to execute the server again. Rinse, repeat, and eventually the server should begin running without a crash.

Locate a database

Now, since this cluster is in an unstable state, it shouldn't be used for production. Instead, we'll locate the most relevant databases we want to keep, so we can later dump them. Enter the "psql" shell with the following command:

$ psql -h /tmp -p 5433 postgres postgres

Again, here's the breakdown:

If all went well, you should now be staring at a screen like this:

psql (15.6)
Type "help" for help.

postgres=#

Now, to list our available databases, we need to run the following command:

postgres=# \l

After pressing Enter, we should see a table of every database present in the cluster. Write down the names of the databases to preserve, as shown in the “Name” column. If you want to explore the contents of the database, you can execute a series of commands like these:

postgres=# \c shop # Connects to the database "shop"
shop=# \dt # Displays all tables in database
shop=# SELECT * IN orders; # Displays all elements in the table "orders"

Once we've kept note of all the databases we want to preserve, we'll be exiting psql by running:

postgres=# \q

Dump away!

In order to ease this process, I've created a script that will read the output off of the "pg_dump" command, and create blank pages in the likely case that some couldn't be found. We can download it from my GitHub Gist using the following command:

$ curl -LO https://gist.githubusercontent.com/otoayana/7623d05b0b60c71160c37771398bfcaf/raw/ada88e9ad936e317ba19787f8886e24e2c96b123/pg_taprec

Once downloaded, we'll give it permissions to execute using "chmod", and afterwards we'll need to open it with a text editor. For the sake of simplicity, I'll be using "nano" as an example.

$ chmod +x pg_taprec
$ nano pg_taprec

Inside this file, we'll need to modify the following variables:

Once we've edited the file, let's execute it, making sure to specify the final output as one of the arguments

$ ./pg_taprec [file]

This process may take a while, as it needs to retry dumping whenever it encounters a missing page. In the meantime, go prepare whatever hot drink you like, and maybe read another article on this blog.

Once it's done, rinse and repeat, changing the "DB" variable to the next database we want to dump, and changing the first argument in the command to a different path, otherwise we'll be overwriting the previous dump.

Shut down, then import

Once we've finished dumping every picked database, let's close the terminal running the server (or terminate the server using Ctrl+C). We can now delete the duplicated cluster, and keep the original somewhere safe, just in the case we need it at a later time.

Let's copy the database dumps into our new server, and install a fresh PostgreSQL server on it, if not yet installed, as specified at the beginning of this post.

Once the new cluster is running, let's create a new database corresponding to a dump, with an owner which can interact with it:

# su - postgres
$ createuser [owner]
$ createdb [database] -O [owner]

Finally, let's import the database dump into it. We can do so using the following command, making sure to replace "[dump]" with the path where our dump is stored within the server:

$ psql [database] < [dump]

Rinse and repeat for each dumped database.

Conclusion

Congrats! You've managed to salvage what you could. However, it's always better to prepare for an emergency such as this one. Consider the following tips for preventing other catastrophes:

Related

PostgreSQL's official documentation
Backblaze


lux (me) aoife Image contains a purple deer with bright neon yellow eyes, glasses and nose. The neon yellow is flickering. Next to the deer it says in large text: 'j0', and under that it says 'deer thing'. The 'thing' and the slash in the zero both are neon yellow and flicker. cal Korrin The text "cadence now!" on a purple background. There is a moon-shaped logo on the left side and a tiny star in the bottom right. powered by alpine Transgender flag
proxied from gemini