Cheesy Web Backups using Rsync

Borkware »
Rants »
Cheesy Web Backups using Rsync

By MarkD on December 31, 2003
Last updated January 3, 2004
See also Vinod Kurup's take on this.

Background

Borkware.com is hosted at Acorn Hosting, a group located in Seattle that provides reasonably-priced hosting that caters to the AOLserver and OpenACS developer crowd (of which I'm one. Borkware.com is based on OpenACS). Since Acorn Hosting is aimed at the developer, part of the Acorn Hosting terms of service is that backups are the responsibility of the customer. The backup process I was using was pretty manual, involving logging in, running a script to collect the data, then scp it down to a local machine to archive to a tape or CD.

In December of 2003, one of Acorn's upstream providers decided to suddenly flake out, close up shop, and remove the physical computers that contained this site (and others) from the data centers, leaving us needing to move to a new server and restore from backups, which unfortunately were a couple of months old. Luckily the Acorn Hosting crew moved heaven and earth and got us back up and running with recent data, and I am thankful for them expending the effort to do that.

After nearly averting disaster, it was time to figure out a reliable and automated backup strategy. A wise friend pointed me to rsync, a remote file synchronization tool, as well as a page on easy automated snapshot-style backups with Linux and rsync. And it did look pretty easy.

What follows here is how I have things set up between the website computer (running a flavor of Linux) and my home systems (Mac OS X), but the techniques should be applicable to any unix system. This is just a simple "synchronize directories across two machines" kind of setup, not the more complicated snapshot backups referenced in the above link. What I end up with is a picture of the disk contents (as of 4 in the morning) saved to my home computer, which I can then archive to tape or a CD.

The General Idea

rsync keeps two directory trees synchronized by sending the differences between files across the network. This means that the initial synchronization will pull all of the data down but subsequent synchronizations will only send across the changes that have been made. Very little information is transferred if the majority of the data is unchanged, as it is with my websites. This is perfect for satisfying both doing backups and for not exceeding my bandwidth quota, which I would do if I moved all the files across the network every time.

Most files, like the website pageroots, can be rsynced "in-place", meaning rsync will just pull down the files from where they sit. There is other data that is not as easily accessed, such as database contents (which need to be exported first), and some privileged configuration files that have to be read as root. This data will need to put into a different form before being downloaded.

Stuff on the Remote System

On the website computer, which I'll refer to as the "remote system", these are the specific pieces of data I need to back up:

A CVS repository
My $HOME directory
the website pageroots and configurations
Postgresql database
Email configuration and mailboxes

The first three are just hierarchies of plain files and are straightforward to do, requring just a single rsync command on the local system. The last two need some work.

Postgresql Database

I use the Postgresql relational database as the back-end of my websites, along with using it for other random database needs (it's an awesome relational database. If you want a cheap, reliable and well designed alternative to Oracle, check it out). The pg_dumpall command will do a full consistent export of the database as a sql script that can be fed to psql on another system to restore the database as it was at the time of the export. This isn't quite as nice as Oracle's point-in-time recovery (since I can lose up to 24 hours of database activity), but for the minimal amount of work involved, and for general disaster recovery, it's not bad.

This is my daily-backup script, which I cleverly call pg-daily-dump.sh:

#!/bin/sh

LD_LIBRARY_PATH=/usr/local/pgsql/lib:/usr/local/lib

DATENAME=`date +"%d-%b-%Y"`
BASENAME="/home/bork/db-backup/pg-${DATENAME}.dmp"

pg_dumpall -U webuser > ${BASENAME}
gzip ${BASENAME}

This can probably be turned into a nice little one-liner, but I find this easy enough to understand. It puts the dump files with the name pattern "pg-day-month-year.dmp.gz" (specifically like pg-31-Dec-2003.dmp.gz) into a directory in my $HOME. The webuser name is whatever username owns the databases you're interested in, which is usually the same as username that the webservers use to access the database. By putting the database dumps into my $HOME they'll be pulled down when I rsync my home directory.

Some folks like the "YYYY-MM-DD" format, since it sorts nicely with ls. Use date +"%Y-%m-%d" in the DATENAME part above to get this date format.

I run this script at 1:45 every morning via my crontab (not as root):

# run at 1:45 A.M.
45 1 * * * /home/bork/pg-daily-dump.sh

Email configurations

I use qmail as my incoming mail handler and to provide some forwarding services for some of the domains I run. qmail is pretty paranoid about ownership and permissions, and because of this I can't read many of the files as a regular user to back them up. So I have root do it. Root can make a tarfile and drop it into my home directory. Like the Postgresql dump, this will be pulled down to the local machine when my $HOME is rsynced.

This is in root's crontab:

# run at 3:05 A.M.
5 3 * * * /bin/tar zcf /home/bork/root-backup/qmail.tar.gz /var/qmail

Which just tars up /var/qmail, containing the binaries, configurations, and mailboxes and sticks it into my $HOME.

I also use DJB's daemontools for keeping services up and running. The run scripts are owned by root and unreadable by the rest of the world, so something similar is done to back them up:

# run at 3:30 A.M.
30 3 * * * /bin/tar zcf /home/bork/root-backup/service.tar.gz /service /service/*/run

Stuff on the local System

So now, by 3:45 or so in the morning, the database has been dumped and the qmail and daemontools stuff has been put into my $HOME. Now it's time pull the files down.

On my iLamp here at home, I have a crontab running at 4 am as an ordinary user:

# run at 4 A.M.
0 4 * * * /Users/bork/rsync-borkware.sh

And the rsync-borkware.sh script is just:

#!/bin/sh
cd /Users/bork/rsync-backup
rsync -a -e ssh bork@borkware.com:/home/bork/ home
rsync -a -e ssh bork@borkware.com:/usr/local/cvsroot/ cvs
rsync -a -e ssh bork@borkware.com:/var/lib/aolserver/ web

The -a parameter performs a recursive synchronization (directories and subdirectories), and it preserves symbolic links, creation and modification times, and preserves group ownership. The -e parameter specifies the mechanism for getting to the remote system. Here I say to use ssh, the secure shell, to provide an encrypted way of moving the data. bork@borkware.com: is the machine, and the user name on that machine to use for logging. The full paths there are the directories on the remote machine I want to synchronize. The final argument is the directory (inside of my $HOME/rsync-backup) on the local machine to put the synchronized files.

Remote Login Fun

If you just set things up like this and let it rip, it won't work. The rsync via ssh will prompt for a password, which will break when run via cron. ssh can be configured to use public/private key authentication for passwordless login. In short, you'll do something like this:

on the local machine (the Mac OS X system in my case), run
% ssh-keygen -t rsa
And use an empty passphrase
copy the public key to the remote machine:
% scp ~/.ssh/id_rsa.pub borkware.com:
log into the remote machine
move the id_rsa.pub file into the .ssh directory and name it authorized_keys
% mv ~/id_rsa.pub ~/.ssh/authorized_keys

If you don't want to do this, Yonaton Feldmen pointed me to the Gentoo Project's Keychain thingie.

That's It

I was surprised how easy it was to set this up. It took less than a day to figure everything out and set up the scripts and crontab entries, and then another day to test and iron out problems.