Tuesday, December 29, 2009

Hardware FakeRAID + LVM == A Big Mess

Every time I install a new distribution of Linux on newer hardware, I wonder just what new "features" will start brewing together in a little digital tempest until the universe of malicious luck senses that ideal, worst possible moment and twists the software around the hardware in a big convoluted knot with glue poured on it.

This time it was CentOS 5.x, a VIA brand hardware raid controller, and the Linux Volume Manager. Not knowing much about how Linux interacts with hardware RAID, I just set everything up at install time using the installer's suggested defaults. What could possibly go wrong? Doh! Never ask that question, even silently in your thoughts. You're just asking to get smacked in the head with the answer.

Then it came, while I was out of town and was depending on this machine to grant me access to the rest of the computers in the house via ssh tunnels. RAID1 mirrored volumes got out of sync, the kernel exhibited it's most endearing neurosis and reported its panic, leaving me locked out until I could get home and alleviate the poor kernel's anxiety attack. It often isn't a very good idea to tell a kernel that everything's okay until you have fixed something, so at least you can believe that you're telling it the truth. So, on to the harder part of fixing something, figuring out what needs to be fixed.

In the days that followed, I managed to disable one of the drives in the mirrored pair and get things running again without the benefit of RAID. However, in the process, the Linux Volume Manager came to believe that since there were two physical volumes with the same UUID, it should mount and use one, and hide the other one. LVM hid the /dev/sdb2 partition in a secret house of mirrors in order to thwart every one of the my attempts to obtain access to it, no matter how clever. Looking back, I wonder if it could have anticipated that I'd open the case, and boot up with only one of the drives' SATA cable plugged in but I somehow managed to do that before LVM decided to delete all my data out of spite. The plan was to use pvchange --uuid on one of the drives so that LVM would know the difference, but that doesn't work while the drive is in use. Rebooting in rescue mode from a CD/DVD (the CentOS 5.1 install disc) allowed the drive to be left unmounted, and allowed the pvchange --uuid command to do its thing.

After many unfruitful searches on Google with every permutation of command names, error message text fragments, and configuration file names I could muster, I tripped over a description of something called fake-RAID. All the symptoms of what happened to my Linux machine seemed to be telling me that that's the disease it had contracted. LVM reported a duplicate volume. The BIOS RAID1 mirrored pair still showed up to the OS as two individual devices. The rear fan made a sneezing noise. Okay, that last one probably wasn't related to the fake-RAID thing. The same article that described fake-RAID, recommended against using hardware RAID on a disk controller that works that way because the Linux drivers are an afterthought from the vendor, at best, and most likely guesswork from the Linux kernel developers. Reboot, enter BIOS RAID utility, delete RAID1 array, verify that POST shows 2 standalone drives in normal non-RAID SATA mode, done.

I have to say it was hard to suppress the feeling of accomplishment that came from making the hardware RAID controller do nothing instead of doing half of something, but my euphoria was short lived. I still had a river crossing ahead of me on the slippery stones that are Linux configuration files, and I didn't want to cross without at least the illusion of a safety rope. The rope couldn't be a worthy illusion if it were tied to a log floating in the river where the hazards lay so I plugged in an external USB hard drive and used the dump command to make my "rope," which took about 8 hours at 5MB/sec to transfer the 144GB root filesystem that I would soon attempt to destroy, multiple times.

With the backup made, it was time to shift data, reconfigure, shift data again, and reconfigure again. It would take time and hassle if I slipped from one of these stones but my safety rope gave me the confidence to step and leap towards the far bank.
  1. fdisk -l -- to show the partitions and types at the start. There were 2 drives, with 2 partitions each. /dev/sda1 and /dev/sdb1 were the previously mirrored /boot partitions, so for now, they'll be left alone. /dev/sda2 and /dev/sdb2 were the previously mirrored PV partitions mapped to /VolGroup00/LogVol00 (root) and LogVol01 (swap). Only /dev/sda2 was now active as an LVM physical volume (PV). /dev/sdb2 was still a "Linux LVM" partition type, but was no longer mounted or included in the LVM Volume Group VolGroup00.
  2. fdisk /dev/sdb -- to change the type id for the larger partition (i.e. sdb2) to 'fd' - Linux raid autodetect" (fdisk inputs: t, 2, fd, w -- to change the type id, on partition 2, to 'fd', and then write).
  3. mdadm --zero-superblock /dev/sdb2 -- Since this wasn't part of a Linux software RAID array before, this step wasn't really necessary, but it's handy to know that the superblock is how Linux software RAID recognizes that a physical partition is part of a defined array. Zeroing the superblock is essentially the permanent removal of the partition from any defined array, so it avoids a warning message that would appear if the partition is specified as one of the devices when creating a new array.
  4. mdadm --create /dev/md0 --verbose --level=1 --raid-devices=2 missing /dev/sdb1 -- This creates a new array with a device name of /dev/md0 that is a mirrored pair and specifies that one of the devices will be added later ('missing') and the other device should be /dev/sdb2 (which just had its superblock zeroed so it could be added to this array without warnings.)
  5. pvcreate /dev/md0 -- This tells LVM that it can use /dev/md0 as a physical volume (PV) which can be added to a volume group. Note: LVM manages the space provided in a PV as a set of "extents" so there is no need to create a file system like ext3 on this or the underlying /dev/md0.
  6. vgdisplay -- Show information about the LVM volume groups on the system. This allows verification of the VolGroup00 name to which the new 1/2 mirror array will be added.
  7. vgextend -v VolGroup00 /dev/md0 -- This adds the PV /dev/md0 (an LVM physical volume which is referenced by its physical device name) to the volume group named VolGroup00. Now any data that is on a logical volume in the VolGroup00 volume group can be moved to the extents that are available on the /dev/md0 PV.
  8. pvdisplay - shows that there are now 2 physical volumes (PV's) /dev/sda2 and /dev/md0 that are both allocated to the VolGroup00 volume group.
  9. pvmove -v /dev/sda2 /dev/md0 -- This gives LVM instructions to migrate the logical volumes (e.g. LogVol00 and LogVol01) from the PV where all their extents are currently stored (/dev/sda2) over to the new PV (/dev/md0) Note: This may run for a while but the -v switch makes it report progress until it's done.
  10. pvdisplay - Shows the details of the two LVM PV's in order to verify that /dev/md0 has all of its PE's (extents) allocated now (Free PE = 0), and /dev/sda2 has all of its PE's free (Allocated PE = 0)
  11. vgreduce VolGroup00 /dev/sda2 -- This removes the PV (/dev/sda2) from the LVM volume group VolGroup00. Until this is done, the /dev/sda2 OS device cannot be reclaimed from LVM control.
  12. pvremove /dev/sda2 -- Since the /dev/sda2 PV is now no longer allocated to any volume group, this actually removes the corresponding /dev/sda2 partition from LVM control and frees it for use outside LVM
  13. fdisk /dev/sda -- to change the type id for the now freed partition (sda2) to 'fd' - Linux raid autodetect" (fdisk inputs: t, 2, fd, w -- to change the type id, on partition 2, to 'fd', and then write).
  14. mdadm --add /dev/md0 /dev/sda2 -- This completes the 2 volume RAID1 mirrored pair named /dev/md0 by adding back the partition (/dev/sda2) from which data was just migrated onto the the array via pvmove. Now mdadm will automatically sync the data that was in the array (physically stored on /dev/sdb2) back over to the device that was added to the array (/dev/sda2).
  15. watch cat /proc/mdstat -- Monitors the sync up progress. This will probably take just about as long as pvmove took to migrate the data to the /dev/md0 PV.
A few VERY Important Final Steps
The question that kept bugging me as I did all this was, "Will the boot process work correctly after these changes next time I reboot?" The answer to that required a few trips back to rescue mode booted from the CentOS install DVD. I suspect I would not have had to make those trips if I had known the following things.
  1. It appeared that the configuration file for mdadm (/etc/mdadm.conf in CentOS 5.x) needed to be correctly configured to reference the mount points and device uuid of the /dev/md0 array. I was still a little fuzzy on how this config file could affect things at boot time. After all, the file is stored in the filesystem that is managed by LVM in a Logical Volume that in turn stores its data as extents in a physical volume (PV) that IS the software RAID array. I couldn't find anything in the /boot/grub setup that defined anything about the raid array or LVM, so how could /etc/mdadm.conf affect the initialization of the software RAID array at boot time, BEFORE the array or LVM is even activated? It doesn't. The boot process happens in stages. The stage where the boot and root filesystems are re-mounted as read/write is where /etc/mdadm.conf matters. So it must be correct. These commands will update it to reflect the actual, active Software RAID arrays.
    mv /etc/mdadm.conf /etc/mdadm.conf.old
    echo "DEVICE partitions" > /etc/mdadm.conf
    echo "MAILADDR root" >> /etc/mdadm.conf
    mdadm --examine --scan >> /etc/mdadm.conf

  2. The init ram drive (initrd) boot image MUST be updated to include support for the software RAID (mdadm/dm-*) modules, so one of the things that I needed to do was build another initrd.?????.img file in the /boot directory (saving the old one just in case). The reason nothing in the grub config files offers a clue to why booting fails is that the relevant stuff is actually embedded within the init ram drive file /boot/initrd-{version}.img. Building a new initrd can be done using mkinitrd to generate an new init script that starts up Software RAID (md) and LVM based on the current, active system configuration. The generated init script embedded inside the initrd-{version}.img file will then have a section like this that bootstraps Software RAID and LVM enough to continue the boot process using resources on a root filesystem that is accessible only with RAID and LVM active:
    echo Scanning and configuring dmraid supported devices

    raidautorun /dev/md0
    echo Scanning logical volumes
    lvm vgscan --ignorelockingfailure
    echo Activating logical volumes
    lvm vgchange -ay --ignorelockingfailure VolGroup00
    resume /dev/VolGroup00/LogVol01
    echo Creating root device.
    mkrootdev -t ext3 -o defaults,ro /dev/VolGroup00/LogVol00
    echo Mounting root filesystem.
    mount /sysroot
    So, with all the Software RAID and LVM stuff running, so that mkinitrd will detect the RAID devices (/dev/md*), and detect the LVM configuration (/dev/mapper/*), and include any required boot-time kernel modules, and generate a workable init script, here's what I should have done BEFORE trying to reboot...
    mv /boot/initrd-$(uname -r).img /boot/initrd-$(uname -r).img.bak
    mkinitrd -v /boot/initrd-$(uname -r).img $(uname -r)

This blog post is partially a way to capture some of the solution and share it with someone else who is perplexed by the same issue, and the remainder of it is me taking an opportunity to write something on this blog that isn't totally useless. I hope you have enjoyed it as much as I have.

Addendum on Fixing initrd from a Rescue Environment
As you may have gathered, I tried to reboot before I had initrd-{version}.img updated to reflect the correct boot-time system configuration. I'm sure someone knows a faster, easier way to get initrd fixed from a rescue CD/DVD boot environment, but as is often the case, trudging through it helped me understand better how things work. The process for manually repairing this part of the boot environment consisted of mounting the boot partition, unpacking the initrd file, adding missing files, editing the init script, and repacking the initrd file for the next reboot. If you're in a similar spot, for this or some other reason, the following steps and references may help.

Manually repairing the init script and/or contents in the initrd (init ram drive) image.
REFERENCE: http://www.ibm.com/developerworks/linux/library/l-initrd.html
REFERENCE: http://wiki.openvz.org/Modifying_initrd_image
  1. boot in rescue mode from the install CD/DVD but skip mounting the root filesystem
  2. mount the boot partition
    mkdir /boot
    mount /dev/sda1 /boot
  3. Make a copy of the existing initrd-{version}.img file with a .gz extension and uncompress it
    cd /boot
    cp initrd-2.6.18-53.1.6.el5.img initrd_for_unpack.img.gz
    gunzip initrd_for_unpack.img.gz
  4. Create a temp directory and unpack the initrd.img contents into it
    mkdir initrd_unpacked
    cd initrd_unpacked
    cpio -i --make-directories < ../initrd_for_unpack.img
  5. Find and fix what's wrong with the init script or initrd contents (e.g. add missing modules, correct device name references in the init script, etc.) For example:
    mkdir /mnt/sysimage
    mount {actual-path-to-root-fs} /mnt/sysimage
    cp /mnt/sysimage/lib/modules/{vers}/kernel/drivers/dm-zero.ko lib/
    cp /mnt/sysimage/lib/modules/{vers}/kernel/drivers/dm-mod.ko lib/
    cp /mnt/sysimage/lib/modules/{vers}/kernel/drivers/dm-mirror.ko lib/
    cp /mnt/sysimage/lib/modules/{vers}/kernel/drivers/dm-snapshot.ko lib/
    vi init
  6. Re-pack a new initrd.img file and compress it
    find ./ | cpio -F newc -o > ../initrd_fixed.img
    cd ..
    gzip initrd_fixed.img
  7. Replace the existing initrd file with the fixed one.
    mv initrd-2.6.18-53.1.6.el5.img initrd-2.6.18-53.1.6.el5.img.bak
    mv initrd_fixed.img.gz initrd-2.6.18-53.1.6.el5.img
  8. Unmount whatever was manually mounted and reboot
    umount /boot
    unmount /mnt/sysimage

No comments: