Sunday, May 8, 2011

Bacula Disk Storage Maintenance

A little over a year ago, I set up Bacula to do backups for the various laptops, netbooks, desktops, and servers in our home network. A few months ago, the backup jobs started failing because the 500GB external drive that holds the disk based backup media files filled up. The issue of not having recent backups was bothering me, but other things were bothering me more, so I hadn't done anything about it yet. Then something crashed, and a file got corrupted, and there was no recent backup to restore, so getting the backups running again started bothering me enough to figure out how to fix it.
The problem is that Bacula is designed around a tape backup system where the physical media does not actually get destroyed, but is simply overwritten when it contains no un-expired backup files. The latest version of Bacula (version 5.0.1 or later) has a new directive to solve part of the problem by truncating the media file associated with a volume when its status changes to "Purged." However, upgrading Bacula seemed like it would take more effort and wouldn't solve the other part of the problem, which is that I had set things up such that the media file name (label) included the job name and the date/time so I could tell what was in each of the files simply by listing the files on the disk. If those volumes are just truncated and "recycled" they end up having stuff in them that doesn't match the media file name.

So before going on with an explanation of how I fixed it, here's a review of the relevant Bacula terms and concepts.

media-file - Instead of a physical tape, disk based backups use a file, on a disk (go figure) so instead of referring to these as Media, I'll call them media-files. (Try not to get stuck on the singular vs. plural thing here.)

backup-job - in this context, this is the record in the Bacula catalog that represents a backup job instance that has already run, has an associated retention-period, and is associated with one or more "volumes."

backup-file - in this context, this is the record in the Bacula catalog database that represents a backup of a file, has an associated retention-period, and is associated with a "backup-job."

volume - in this context, this is not the actual media-file, but refers to the record in the Bacula catalog database that keeps track of a Media File, links to the backup-jobs (and indirectly to the backup-files) for which the media-file contains data, and stores the volume's status, retention period, last-written date, etc.

pruning - The process of removing backup-job and backup-file records associated with a volume, once they're past their expiration date.

volume-delete - removing the volume record from the Bacula database. If the volume were associated with a physical tape, the tape could be reused, but if the volume is associated with a media-file on a disk, the media-file is NOT deleted.

media-file-delete - using and operating system command (rm) to remove the physical media-file from the disk. This should only be done AFTER the volume-delete has been completed.

volume-status - This isn't a complete list, but the relevant ones here are Full, Used, Append, and Purged. Volumes in one of the first three are candidates for pruning. Volumes in the last status, Purged, are candidate for volume-delete.


So, what was the problem that caused the backup storage disk to fill up? Bacula doesn't automatically delete the media-files when the volume status changes to "Purged." Also, if volumes are not set to auto-recycle there is never an event that would auto-prune a volume and change its status to "Purged" anyway. Both of these things have to be done by a scheduled process (e.g. cron) outside of the Bacula daemons (or they could probably also be implemented as Python scripts that run within the Bacula director daemon, but I've set that ambition aside for another day.)

So then, what's the solution? There is no single solution of course, but my solution was to write a shell script (BASH on Linux in this case) and schedule it as a cron job (weekly seems be often enough). First it invokes the prune command on each volume in the Bacula catalog which will mark the volume-status as "Purged" if all of the backup-jobs (and backup-files) associated with the volume have expired. If not everything has expired, the volume will not be marked as "Purged" by the prune command, so the volume-status will stay as it is. Then, for each volume that has a "Purged" status, volume-delete it to get rid of the record in the Bacula catalog database, and then media-file-delete it to release the free space on the disk.

Pseudocode for the script is:
  1. Invoke Bacula's bconcole list volumes command to list all volumes
  2. Pipe the output through grep to filter down to the volume info lines that indicate a status of Used, Full, or Append.
  3. Extract the volume name from each line and invoke Bacula's bconsole prune volume={volume-name} command to cause the status of each eligible volume to change to "Purged."
  4. Invoke Bacula's bconsole list volumes command again to list all volumes
  5. Pipe the output through grep to filter down to the volume info lines that indicate a status of Purged.
  6. Extract the volume name from each line and...
    1. Invoke Bacula's bconsole delete volume={volume-name} command.
    2. Invoke the operating system's delete (rm) command to remove the media-file associated with the volume.


Script 1: prune-all-volumes.sh

#!/bin/sh

temp_dir=/var/lib/bacula/maint_tmp

temp_file_all_volumes=`mktemp -p $temp_dir`
temp_file_volume_lines=`mktemp -p $temp_dir`

# use bacula's bconsole to list volumes and calculate which ones
# to run the prune command on based on the retention
# vs. last written date.
/usr/sbin/bconsole > $temp_file_all_volumes <<END_OF_DATA
list volumes
quit
END_OF_DATA

cat $temp_file_all_volumes | grep -E "\| Full|\| Used" > $temp_file_volume_lines
while read volume_info_line
do
  # Note: The sed part just trims leading and trailing whitespace
  # echo "line = '$volume_info_line'"
  volume_name=$(echo "$volume_info_line" | cut -d"|" -f3 | sed 's/^[ \t]*//;s/[ \t]*$//')
  retention=$(echo "$volume_info_line" | cut -d"|" -f8 | sed 's/^[ \t]*//;s/[ \t]*$//')
  last_written=$(echo "$volume_info_line" | cut -d"|" -f13 | sed 's/^[ \t]*//;s/[ \t]*$//')
  last_written_year=$(echo "$last_written" | cut -d"-" -f1)
  last_written_month=$(echo "$last_written" | cut -d"-" -f2)
#  if [ $last_written_year = "2010" ] && [ $last_written_month = "04" ]; then
echo "volume_name = $volume_name, retention = $retention, last_written = $last_written, year=$last_written_year"
/usr/sbin/bconsole <<END_OF_DATA
list volume=$volume_name
prune volume=$volume_name
yes
list volume=$volume_name
quit
END_OF_DATA
#  fi
done < $temp_file_volume_lines

Script 2 - delete-purged-volumes.sh:
#!/bin/sh

temp_dir=/var/lib/bacula/maint_tmp
log_dir=/var/lib/bacula/maint_logs
bacula_storage_device_dir=/mnt/wdmybook/bacula_data/storage_device_dir


datevar=`date +%Y-%m-%d-%H%M%S`
delete_script_log_file=$log_dir/delete-purged-volumes-$datevar.log
temp_file_all_volumes=`mktemp -p $temp_dir`
temp_file_purged_volume_names=`mktemp -p $temp_dir`

# use bacula's bconsole to list purged volumes
/usr/sbin/bconsole > $temp_file_all_volumes <<END_OF_DATA
list volumes
quit
END_OF_DATA
cat $temp_file_all_volumes | grep -E "Purged" | awk '{print $4}' > $temp_file_purged_volume_names

echo "Delete all 'Purged' Bacula Volume records"
echo "and the corresponding disk files."
echo "Remember to run maint-prune-all-volumes.sh first."
echo "or this may not do very much."
echo "For results/messages, review log file: $delete_script_log_file"

while read volume_name
do
echo '' >> $delete_script_log_file
echo Using bconsole to delete purged volume record $volume_name >> $delete_script_log_file
/usr/sbin/bconsole >> $delete_script_log_file <<END_OF_DATA
list volume=$volume_name
quit
END_OF_DATA
/usr/sbin/bconsole >> $delete_script_log_file <<END_OF_DATA
delete volume=$volume_name
yes
quit
END_OF_DATA
/usr/sbin/bconsole >> $delete_script_log_file <<END_OF_DATA
list volume=$volume_name
quit
END_OF_DATA
echo removing physical file $bacula_storage_device_dir/$volume_name >> $delete_script_log_file
ls $bacula_storage_device_dir/$volume_name >> $delete_script_log_file
rm $bacula_storage_device_dir/$volume_name >> $delete_script_log_file
ls $bacula_storage_device_dir/$volume_name 2>> $delete_script_log_file >> $delete_script_log_file
done < $temp_file_purged_volume_names

1 comment:

iS said...

Great Post! Having the exact same issue - trying to set up a file-based backup solution with bacula where old backup volumes/files are automatically deleted.

I would be great if you could share your shell script and maybe your bacula configuration files as well.

Thank you,

iS