Automating Backups for Fun and Profit

At the end of last month, I mentioned that I had bought a new hard drive for the purposes of backing up my data. I just now installed it in my second computer. Was getting the backup system in place important? Yes, absolutely! I don’t think that a hard drive might fail. I know it WILL fail.

While I could just simply copy files from one drive to the other manually, it depends on me to do so regularly. I’m only human, and I can forget or get sick, resulting in potentially lost data. Computers are meant to do repetitive tasks really well, so why not automate the backup process? My presence won’t be necessary, so backups can take place even if I go on vacation for a few weeks or months. The backups can be scheduled to run when I won’t be using the computer. Copying lots of data while trying to write code, check email, and listen to music at the same time can be annoyingly slow, but if it happens while I am sleeping, it won’t affect my productivity at all. Also, while the computer can faithfully run the same steps each time successfully, I might mess up if I run the steps manually and in the wrong order. So with an automated system, my backups can be regularly recurring, convenient, and reliable. Much better than what I could do on my own week after week. I decided to make those concerns into my three goals for this system.

I’ll go over my plan, but first I’ll provide some background information.

Some Background Information
LauraGB is my main Debian GNU/Linux machine. MariaGB is currently my Windows machine. The reason for the female names is because I am more a computer enthusiast than a car enthusiast. People name their cars after women, so I thought it was appropriate to name my computers similarly. For the record, my car’s name is Caroline, but she doesn’t get nearly the same care as Laura or Maria. My initials make up the suffix, GB. I guess I am not very creative, as my blog, my old QBasic game review site, and my future shareware company will all have GB. Names can change of course, but now I see I am on a tangent, so let’s get back to the backup plan.

LauraGB has two hard drives. She can run on the 40GB drive, as it has the entire filesystem on it, but the 120GB drive acts as a huge repository for data that I would like to keep. Files like my reviews for Game Tunnel, the game downloads, my music collection, funny videos I’ve found online, etc. It’s a huge drive.

For the most part, I’ve simply had backup copies of data from the 40GB drive to the 120GB drive. I also had data from an old laptop on there. Then I started collecting files there, but they don’t have a second copy anywhere. If I lose that 120GB drive, I could recover some of the files, but there is no recovery for a LOT of data. Losing that drive spells doom.

At this point, MariaGB can become much more useful than its current role as my Games OS. With the new 160GB drive, I can now have at least two copies of any data I own.

The Backup System
I spent the past few weeks or months looking up information on automating backups. I wanted something simple and non-proprietary, and so I decided to go with standard tools like tar and cron. I found information on sites like About.com, IBM, and others. I’m also interested in automating builds for projects, and so I got a lot of ideas from the Creatures 3 project.

On Unix-based systems, there is a program called cron which allows you to automate certain programs to run at specific times. Each user on the system can create his/her own crontab file, and cron will check it and run the appropriate commands when specified. For example, here is a portion of my crontab file on LauraGB:

# Backup /home/gberardi every Monday, 5:30 AM
30 5 * * 1 /home/gberardi/Scripts/BackupLaura.sh

The first line is a comment, signified by the # at the beginning, which means that cron will ignore it. The second line is what cron reads. The first part tells cron when to run, and the second part tells cron what to run. The first part, according to the man page for crontab, is divided as follows:

field allowed values
—– ————–
minute 0-59
hour 0-23
day of month 1-31
month 1-12
day of week 0-7

So as you can see, I have told cron that my job will run at 5:30 AM on Mondays (30 5 * * 1). The asterisks basically tell cron that those entries do not matter.

The second part tells cron to run a Bash script I called BackupLaura.sh, which is where most of the work gets done.

Essentially, it gets the day of the month (1-31) and figures out which week of the month it is (1-5). There are five because it is possible to have five Mondays in a single month. Once it figures out which week it is, it then goes to my 120GB drive and removes the contents of the appropriate backup directory. I called them week1, week2, etc. It then copies all of the files from my home directory to the weekX directory using the rsync utility. I use rsync because using a standard copy utility would change the file data, resulting in files that look like they were all accessed at once. Rsync keeps the same permissions and date access times as they were before the backup.

So tomorrow at 5:30 AM, this script will run. As it will find that the date is 11 (April 11th), it knows that it is in the second week. So the directory week2 will be emptied, and then all files will be copied from my home directory to week2.

That’s all well and good, but you have probably noted that every week, the weekly backup erases the same week’s backup from the month before. When this script runs on May 9th, week2 will be erased, losing all of April’s 2nd week backup data! I’m ok with that.

Here’s why: every 1st Monday of the month, the script will make the week1 backup, but it will ALSO make a monthly backup. It takes week1 and runs another utility on it called tar. Multiple files can be combined into one giant file by using tar. The resulting file can also be compressed. Most people on Windows will create .zip files using similar utilities, but tar fits the Unix programming philosophy: a robust tool that does one thing really well.

Usually tar is used with utilities like gzip or bzip2, but sometimes compression is avoided for stability reasons in corporate environments. Compressing might save you a lot of space, but on the off chance that something goes wrong, you can lose data. And if that data is important enough, the risk isn’t worth the space savings.

In my case, I decided to use bzip2 since it compresses much better than gzip or LZW (what .zip files are compressed with). If I corrupt something, it isn’t the end of the world, since bzip2 has the ability to recover from such problems. So the script will take the directory week1 and compress it into a file with the date embedded in the name. The appropriate line in the script is:

tar -cvvjf $BACKUP_DIR_ROOT/monthly/LauraGBHomeBackup-$(date +%F).tar.bz2 $BACKUPDIR

The $BACKUPSOMETHING are variables in the Bash script that I had defined earlier, but they basically tell the script where to go in my filesystem. The file created is “LauraGBHomeBackup- “+ the date in YYYY-MM-DD format + .tar.bz2. The script runs date +%F and inserts the result in the filename. The file is placed in the directory called monthly. I can manually clean that directory out as necessary, and since the dates will all be unique, none of the monthly backups will get erased and overwritten like the weekly backups.

Conclusion
And so now the need to do backups has been automated. Every week, a copy of my home directory is synced to a second hard drive. Every month, a copy of the data will be compressed into a single file named by date to make it easy for me to search for older data. What’s more, cron will send me an email to let me know that the job ran and also tell me what the output was. If I forgot to mount the second hard drive for some reason and the script couldn’t copy files over, I’ll know about it Monday morning when I wake up.

Once MariaGB is up and running, I can configure the two systems to work together regarding the monthly backups. After the .tar.bz2 file is created, I can then copy it from LauraGB’s monthly directory over to MariaGB in a directory to be determined later. Of course, normally when I copy a file from one machine to another, I would need to manually enter the username and password. I can get around this by letting the two systems consider each other “trusted”, meaning that the connection between the two can be automated, which is consistent with one of the goals of this system. I’m very proud about what I have created, and I am also excited about what I can do in the future.

Currently LauraGB’s home directory has images, code, homework files, game downloads, and other files taking up 4GB of space, which clearly won’t fit on a CDROM uncompressed. To improve my backup system, I will need to purchases a DVD burner. I can place a blank CD in the drive and have the computer automatically burn the monthly backup when it occurs, giving me another copy of my data, one that I can then bring anywhere. Ideally, remote backups would complete my system, but I think I will use those only for specific data, such as my programming projects and business data when I get some. Losing my music files and pictures in a house fire aren’t going to be as big of a conern as the fire itself, but losing sales info, game code, and the customer database, essentially THE BUSINESS, would be something that I probably won’t easily recover from.

For now, I am mitigating the disaster of a failing hard drive, which is my main concern. If you do not have your own backup system in place, either something homemade like my setup or through some commercial product, you’re walking on thin ice. Remember, hard drive failure isn’t just a potential event. It’s a certainty. It WILL fail. Prepare accordingly.