Borkware.com is hosted at Acorn
Hosting, a group located in Seattle that provides
reasonably-priced hosting that caters to the AOLserver and OpenACS developer crowd (of which I'm
one. Borkware.com is based on OpenACS). Since Acorn Hosting is aimed
at the developer, part of the Acorn Hosting terms of service is that
backups are the responsibility of the customer. The backup process I
was using was pretty manual, involving logging in, running a script to
collect the data, then scp it down to a local machine to
archive to a tape or CD.
In December of 2003, one of Acorn's upstream providers decided to suddenly flake out, close up shop, and remove the physical computers that contained this site (and others) from the data centers, leaving us needing to move to a new server and restore from backups, which unfortunately were a couple of months old. Luckily the Acorn Hosting crew moved heaven and earth and got us back up and running with recent data, and I am thankful for them expending the effort to do that.
After nearly averting disaster, it was time to figure out a reliable
and automated backup strategy. A wise friend pointed me to rsync, a remote file
synchronization tool, as well as a page on easy
automated snapshot-style backups with Linux and rsync. And it did
look pretty easy.
What follows here is how I have things set up between the website
computer (running a flavor of Linux) and my home systems (Mac OS X),
but the techniques should be applicable to any unix system. This is
just a simple "synchronize directories across two machines" kind of
setup, not the more complicated snapshot backups referenced in the
above link. What I end up with is a picture of the disk contents (as of
4 in the morning) saved to my home computer, which I can then archive
to tape or a CD.
rsync keeps two directory trees synchronized by sending
the differences between files across the network. This means that the
initial synchronization will pull all of the data down but subsequent
synchronizations will only send across the changes that have been
made. Very little information is transferred if the majority of the
data is unchanged, as it is with my websites. This is perfect for
satisfying both doing backups and for not exceeding my bandwidth quota,
which I would do if I moved all the files across the network
every time.
Most files, like the website pageroots, can be rsynced
"in-place", meaning rsync will just pull down the files
from where they sit. There is other data that is not as easily
accessed, such as database contents (which need to be exported first),
and some privileged configuration files that have to be read as root.
This data will need to put into a different form before being
downloaded.
$HOME directory
rsync command on the local
system. The last two need some work.
I use the Postgresql relational
database as the back-end of my websites, along with using it for other
random database needs (it's an awesome relational database. If you
want a cheap, reliable and well designed alternative to Oracle, check
it out). The pg_dumpall command will do a full
consistent export of the database as a sql script that can be fed to
psql on another system to restore the database as it was
at the time of the export. This isn't quite as nice as Oracle's
point-in-time recovery (since I can lose up to 24 hours of database
activity), but for the minimal amount of work involved, and for
general disaster recovery, it's not bad.
This is my daily-backup script, which I cleverly call
pg-daily-dump.sh:
#!/bin/sh
LD_LIBRARY_PATH=/usr/local/pgsql/lib:/usr/local/lib
DATENAME=`date +"%d-%b-%Y"`
BASENAME="/home/bork/db-backup/pg-${DATENAME}.dmp"
pg_dumpall -U webuser > ${BASENAME}
gzip ${BASENAME}
This can probably be turned into a nice little one-liner, but I find
this easy enough to understand. It puts the dump files with the name
pattern "pg-day-month-year.dmp.gz" (specifically like
pg-31-Dec-2003.dmp.gz) into a directory in my
$HOME. The webuser name is whatever
username owns the databases you're interested in, which is usually the
same as username that the webservers use to access the database. By
putting the database dumps into my $HOME they'll be
pulled down when I rsync my home directory.
Some folks like the "YYYY-MM-DD" format, since it sorts nicely with
ls. Use date +"%Y-%m-%d" in the
DATENAME part above to get this date format.
I run this script at 1:45 every morning via my crontab
(not as root):
# run at 1:45 A.M. 45 1 * * * /home/bork/pg-daily-dump.sh
I use qmail as my incoming
mail handler and to provide some forwarding services for some of the
domains I run. qmail is pretty paranoid about ownership
and permissions, and because of this I can't read many of the files as
a regular user to back them up. So I have root do it. Root
can make a tarfile and drop it into my home directory. Like the
Postgresql dump, this will be pulled down to the local machine when my
$HOME is rsynced.
This is in root's crontab:
Which just tars up# run at 3:05 A.M. 5 3 * * * /bin/tar zcf /home/bork/root-backup/qmail.tar.gz /var/qmail
/var/qmail, containing the binaries,
configurations, and mailboxes and sticks it into my
$HOME.
I also use DJB's daemontools for keeping
services up and running. The run scripts are owned by
root and unreadable by the rest of the world, so something similar is
done to back them up:
# run at 3:30 A.M. 30 3 * * * /bin/tar zcf /home/bork/root-backup/service.tar.gz /service /service/*/run
$HOME. Now it's time pull the files down.
On my iLamp here at home, I have a crontab running at 4 am as an ordinary user:
And the# run at 4 A.M. 0 4 * * * /Users/bork/rsync-borkware.sh
rsync-borkware.sh script is just:
The#!/bin/sh cd /Users/bork/rsync-backup rsync -a -e ssh bork@borkware.com:/home/bork/ home rsync -a -e ssh bork@borkware.com:/usr/local/cvsroot/ cvs rsync -a -e ssh bork@borkware.com:/var/lib/aolserver/ web
-a parameter performs a recursive synchronization
(directories and subdirectories), and it preserves symbolic links,
creation and modification times, and preserves group ownership. The
-e parameter specifies the mechanism for getting to the
remote system. Here I say to use ssh, the secure shell,
to provide an encrypted way of moving the data.
bork@borkware.com: is the machine, and the user name on
that machine to use for logging. The full paths there are the
directories on the remote machine I want to synchronize. The final
argument is the directory (inside of my
$HOME/rsync-backup) on the local machine to put the
synchronized files.
If you just set things up like this and let it rip, it won't work.
The rsync via ssh will prompt for a
password, which will break when run via cron.
ssh can be configured to use public/private key
authentication for passwordless login. In short, you'll do something
like this:
% ssh-keygen -t rsa% scp ~/.ssh/id_rsa.pub borkware.com:
authorized_keys% mv ~/id_rsa.pub ~/.ssh/authorized_keys
crontab entries, and then another day to test and iron
out problems.