Duplicity Backup to S3 with Raspberry PI
After my search for a durable online backup, the conclusion was I need to encrypt my files and backup them to S3. So how do I do that? There are probably a couple of ways to do it but a search for “encrypted backup s3”, in Google, quickly shows results for duplicity.
By coincidence, my Raspberry Pi B+ has just arrived and as it operates with a power of around 1.5 Watts (with an wifi dongle) it is ideal to leave running doing the backups. At least while I don’t need it as a media center, of course. So lets configure a backup system with the Pi (for short), duplicity and S3 as the storage.
Base image
First we need to choose a base system for the Pi. If we also want a media server to occupy the cold nights then Raspbmc seems pretty good. Besides being a debian and having XBCM pre-configured it has the added bonus of allowing to pre-configure the network settings, which makes the first boot work even through wifi.
We start by downloading and run the installer which then, in turn, downloads the latest Raspbmc network image and writes it to the micro USB card. After that, we just plug the Pi to the TV, plug it to the power and watch it boot. The first boot takes a while but it is possible to monitor it’s progress on the TV. Just be sure that the DVI is connected when the Pi starts because connecting the DVI cable afterwards didn’t always work.
After a few restarts we get a configured XBMC. Tada! Anyway, today we’re trying to configure a backup system so let’s move on.
First run with Duplicity
The Raspbmc system provides an SSH server and we can connect with the user “pi” and the password “raspberry”, by default. The user “pi” is also a sudoer so installing duplicity with support for S3 backups is rather easy.
1 | $ sudo apt-get install duplicity python-boto |
After the installation is complete we can start a backup right away. Lets create some folders and do a first run.
1 | $ mkdir src |
We can see that, by default, duplicity asks for a password and encrypts the files stored in the target location. In order to have an automated backup system it needs to be non-interactive. That can achieved easily with the use of an environment variable.
1 | $ PASSPHRASE=testing duplicity src file://dst |
Also, since duplicity uses GnuPG, the encryption is done with the CAST5 cipher which is the default algorithm. That can be changed to AES256 by passing extra options to GPG.
1 | $ PASSPHRASE=testing duplicity --gpg-options --cipher-algo=AES256 src file://dst |
We already installed support for the Amazon storage – with python-boto
– so
the file destination can be changed to an S3 URL and everything works as
before. A typical S3 URL has the form
s3://s3-<region>.amazonaws.com/<bucket>/<path>
but there are other forms you
can check in the S3 Documentation. Using the S3 backend requires us to
setup the authentication and pass the Amazon’s AWS access key id and secret
key. This can be done with the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
Scripting duplicity
In order to control the amount of options a script comes in handy. Also, having a script allows us to run the backup easily and control more aspects of the backup. Lets start by putting the basics in a Bash script and accumulate logging in a file.
1 |
|
The options were expanded, for example, to specify a local cache directory with
the --archive-dir
option. The cache directory is where duplicity keeps the
non-encrypted control files (manifest and sigtar). Keeping the cache next to the
script is a great way to keep track of progress and correct possible problems.
Scheduled Backup
Now that we have a script, a scheduled backup would be great. Cron is the
obvious choice but apparently it is not enabled by default in Raspbmc. The
easiest way to enable Cron is through the XBMC UI under Programs -> Raspbmc
Settings -> System Configuration -> Service Management -> Cronjob
Scheduler. After Cron is enabled we can create a file under /etc/cron.d
with
the commands we want to schedule. Note that any file placed there will be
processed after 1 minute which means that changes and new commands will be in
effect 2 minutes later.
1 | # min hour day month weekday user command |
This will run the duplicity script at 9:15 PM of every Monday, that is, it will do an incremental backup every week (after the first full backup, off course). Any output from cron scripts are mailed to the local user and you can use that. Nevertheless, the previous script has no output because duplicity’s output is redirect to a log file. We we must take care of sending email explicitly but, then again, we have more control of the email that is sent.
Configuring Email
The idea is to have an email, at the end of the backup, indicating if everything happened has expected or not. That can be accomplished with the subject line which means we can use the body of the email for the full output of the backup, for example.
The first thing we need to do is install the missing software. There are a few options but one of the simplest and most common is sSMTP.
1 | $ sudo apt-get install ssmtp mailutils |
We need to configure sSMTP to send email through some server. To use GMail, for
example, we need to change the configuration file in /etc/ssmtp/ssmtp.conf
.
1 | # |
We also need to change /etc/ssmtp/revaliases
to give a valid address to the
local users root
(running cron and other services) and pi
running the backup
with duplicity (and XBMC). The revaliases file is pretty simple.
1 | root:your.account@gmail.com:smtp.gmail.com:587 |
There are several sources online that can help fine-tuning the configuration but
this is enough to be able to send emails. To test the configuration we can use
the mail
command.
1 | $ echo Hello | mail -s "World of Mail" "your.account@gmail.com" |
If you received the email then we can change the backup script and add a mail notification at the end of it. If the backup had no problems we include “OK” in the subject. Otherwise we add “ERR” to make it clear that something did not work as expected.
1 |
|
You can test it from the command line. To make things a little more interesting we can increase the verbosity of the backup.
1 | $ OPTIONS="-v info" ./duplicity.sh |
While it is running you can see that the script creates a log file based on the PID. At the end of it the mail is sent, that log file is added to the main log and then removed.
Rotating the Log
Although the amount of logging is small, by default, in order to be able to
leave the Rasberry Pi completely alone we still need to ensure the log does not
grow forever. Rotating the log is a matter of creating a new file under
/etc/logrotate.d
like the following.
1 | /home/pi/backup/duplicity.log { |
We can check logrotate man page and adjust the directives to our liking but anything that ensures a constant overall size of the log files is OK.
Real Life
All of the above provides a reasonably reliable and secure backup, in theory. In reality not everything works as expected with duplicity.
In my particular case the source files are in a NAS that is mounted locally through CIFS. The NAS goes to sleep at midnight and, when the backup is still running, duplicity fails. That was expected and should not be a problem because duplicity has support resume a backup. Nevertheless it so happens that version 0.6.18 (included in Rasbian Wheezy) has a bug and the restart is not so great. For example, after a full backup of 200GB, which took a couple of days and experienced a few restarts, the next incremental backup uploaded almost 100GB. Not a good signal, since no file was changed in the source location.
So, next stop, upgrade. There is a version 0.6.24 in the Raspbian repositories
under the Jessie distribution. Instead of adding the distribution to the
sources.list
I’ve just downloaded the package and installed it directly. All
the dependencies are satisfied so no problem there. Using that new version I’ve
run a few tests just to find out that, although the restarts seem to work, this
version has another bug. When a system error is encountered duplicity does not
exit completely because a GPG process is left hanging and this forced me to kill
the process by hand.
This problem does not seemed known and there is not more recent package to try so hard luck. I found duplicity’s page at Launchpad and checked that the last source version (0.6.25) had no bug-fix for that error. That lead to the creation of and account and opening bug #1395341. While I was exploring Launchpad and the source code for duplicity, I decided to give it a try in fixing the bug. After learning the basics of Bazaar and installing the required dependencies, in a virtual machine, the bug turned out to be simple to spot and fix. This resulted in a patch, a branch, and a merge proposal which, by the way, was eventually accepted and is added to both 0.6 and 0.7 branches. Woo hoo!
Looking at the packages we can see that the next Ubuntu release will include
version 0.6.23 of duplicity which means it will take a while before 0.6.26
or 0.7 are included in a stable (or event testing) Debian release. The
alternative is to download the 0.6.25 tarball, apply the patch, install the
required build tools, build duplicity from source and use the custom version by
setting PATH
and PYTHONPATH
. It’s actually simpler than it looked at first
despite the fact you have to bring development tools into the system.
1 | $ sudo apt-get install gcc patch librsync-dev python-dev |
The deprecation warning was added recently to the 0.6 branch of duplicity. You
can always try the new 0.7.x versions. Nevertheless, if you define PATH
and
PYTHONPATH
as above at the start of the backup script the new patched
duplicity will be used.
Wrapping up
Despite all the problems in the last step, it’s obvious that with duplicity we can setup a simple, secure, and efficient backup system to the cloud. All the tools are readily available – including the development tools – and are easy to setup in the Raspberry Pi.
When you know what to do, setting up the backup from scratch would take less than 1 hour. But then, all the testing and modifications of the script take days or weeks. We can expand the script to allow restoring files and purging old backups from the remote storage. We can also change the encryption to use public key encryption and the GPG agent. The sky (or your shell script skills) is the limit. Projects like Duply, I guess, started much in the same way and now provides a very advanced script with a lot of extra features. It does not include custom mail notifications, though, an relies on cron emails.
It all depends on your needs. If you are trying to keep things simple or totally under your control you may want to start your script. Otherwise starting with a powerful foundation and grow from there may be appropriate. In all cases, though, you will need to read a bit of documentation because most configuration options are important. In the end, we want a backup that is simple, fast, secure and, most of all, that we can restore when we actually need it which by definition is the worst moment possible.