Tuesday, December 29, 2009

Hardware FakeRAID + LVM == A Big Mess

Every time I install a new distribution of Linux on newer hardware, I wonder just what new "features" will start brewing together in a little digital tempest until the universe of malicious luck senses that ideal, worst possible moment and twists the software around the hardware in a big convoluted knot with glue poured on it.

This time it was CentOS 5.x, a VIA brand hardware raid controller, and the Linux Volume Manager. Not knowing much about how Linux interacts with hardware RAID, I just set everything up at install time using the installer's suggested defaults. What could possibly go wrong? Doh! Never ask that question, even silently in your thoughts. You're just asking to get smacked in the head with the answer.

Then it came, while I was out of town and was depending on this machine to grant me access to the rest of the computers in the house via ssh tunnels. RAID1 mirrored volumes got out of sync, the kernel exhibited it's most endearing neurosis and reported its panic, leaving me locked out until I could get home and alleviate the poor kernel's anxiety attack. It often isn't a very good idea to tell a kernel that everything's okay until you have fixed something, so at least you can believe that you're telling it the truth. So, on to the harder part of fixing something, figuring out what needs to be fixed.

In the days that followed, I managed to disable one of the drives in the mirrored pair and get things running again without the benefit of RAID. However, in the process, the Linux Volume Manager came to believe that since there were two physical volumes with the same UUID, it should mount and use one, and hide the other one. LVM hid the /dev/sdb2 partition in a secret house of mirrors in order to thwart every one of the my attempts to obtain access to it, no matter how clever. Looking back, I wonder if it could have anticipated that I'd open the case, and boot up with only one of the drives' SATA cable plugged in but I somehow managed to do that before LVM decided to delete all my data out of spite. The plan was to use pvchange --uuid on one of the drives so that LVM would know the difference, but that doesn't work while the drive is in use. Rebooting in rescue mode from a CD/DVD (the CentOS 5.1 install disc) allowed the drive to be left unmounted, and allowed the pvchange --uuid command to do its thing.

After many unfruitful searches on Google with every permutation of command names, error message text fragments, and configuration file names I could muster, I tripped over a description of something called fake-RAID. All the symptoms of what happened to my Linux machine seemed to be telling me that that's the disease it had contracted. LVM reported a duplicate volume. The BIOS RAID1 mirrored pair still showed up to the OS as two individual devices. The rear fan made a sneezing noise. Okay, that last one probably wasn't related to the fake-RAID thing. The same article that described fake-RAID, recommended against using hardware RAID on a disk controller that works that way because the Linux drivers are an afterthought from the vendor, at best, and most likely guesswork from the Linux kernel developers. Reboot, enter BIOS RAID utility, delete RAID1 array, verify that POST shows 2 standalone drives in normal non-RAID SATA mode, done.

I have to say it was hard to suppress the feeling of accomplishment that came from making the hardware RAID controller do nothing instead of doing half of something, but my euphoria was short lived. I still had a river crossing ahead of me on the slippery stones that are Linux configuration files, and I didn't want to cross without at least the illusion of a safety rope. The rope couldn't be a worthy illusion if it were tied to a log floating in the river where the hazards lay so I plugged in an external USB hard drive and used the dump command to make my "rope," which took about 8 hours at 5MB/sec to transfer the 144GB root filesystem that I would soon attempt to destroy, multiple times.

With the backup made, it was time to shift data, reconfigure, shift data again, and reconfigure again. It would take time and hassle if I slipped from one of these stones but my safety rope gave me the confidence to step and leap towards the far bank.
  1. fdisk -l -- to show the partitions and types at the start. There were 2 drives, with 2 partitions each. /dev/sda1 and /dev/sdb1 were the previously mirrored /boot partitions, so for now, they'll be left alone. /dev/sda2 and /dev/sdb2 were the previously mirrored PV partitions mapped to /VolGroup00/LogVol00 (root) and LogVol01 (swap). Only /dev/sda2 was now active as an LVM physical volume (PV). /dev/sdb2 was still a "Linux LVM" partition type, but was no longer mounted or included in the LVM Volume Group VolGroup00.
  2. fdisk /dev/sdb -- to change the type id for the larger partition (i.e. sdb2) to 'fd' - Linux raid autodetect" (fdisk inputs: t, 2, fd, w -- to change the type id, on partition 2, to 'fd', and then write).
  3. mdadm --zero-superblock /dev/sdb2 -- Since this wasn't part of a Linux software RAID array before, this step wasn't really necessary, but it's handy to know that the superblock is how Linux software RAID recognizes that a physical partition is part of a defined array. Zeroing the superblock is essentially the permanent removal of the partition from any defined array, so it avoids a warning message that would appear if the partition is specified as one of the devices when creating a new array.
  4. mdadm --create /dev/md0 --verbose --level=1 --raid-devices=2 missing /dev/sdb1 -- This creates a new array with a device name of /dev/md0 that is a mirrored pair and specifies that one of the devices will be added later ('missing') and the other device should be /dev/sdb2 (which just had its superblock zeroed so it could be added to this array without warnings.)
  5. pvcreate /dev/md0 -- This tells LVM that it can use /dev/md0 as a physical volume (PV) which can be added to a volume group. Note: LVM manages the space provided in a PV as a set of "extents" so there is no need to create a file system like ext3 on this or the underlying /dev/md0.
  6. vgdisplay -- Show information about the LVM volume groups on the system. This allows verification of the VolGroup00 name to which the new 1/2 mirror array will be added.
  7. vgextend -v VolGroup00 /dev/md0 -- This adds the PV /dev/md0 (an LVM physical volume which is referenced by its physical device name) to the volume group named VolGroup00. Now any data that is on a logical volume in the VolGroup00 volume group can be moved to the extents that are available on the /dev/md0 PV.
  8. pvdisplay - shows that there are now 2 physical volumes (PV's) /dev/sda2 and /dev/md0 that are both allocated to the VolGroup00 volume group.
  9. pvmove -v /dev/sda2 /dev/md0 -- This gives LVM instructions to migrate the logical volumes (e.g. LogVol00 and LogVol01) from the PV where all their extents are currently stored (/dev/sda2) over to the new PV (/dev/md0) Note: This may run for a while but the -v switch makes it report progress until it's done.
  10. pvdisplay - Shows the details of the two LVM PV's in order to verify that /dev/md0 has all of its PE's (extents) allocated now (Free PE = 0), and /dev/sda2 has all of its PE's free (Allocated PE = 0)
  11. vgreduce VolGroup00 /dev/sda2 -- This removes the PV (/dev/sda2) from the LVM volume group VolGroup00. Until this is done, the /dev/sda2 OS device cannot be reclaimed from LVM control.
  12. pvremove /dev/sda2 -- Since the /dev/sda2 PV is now no longer allocated to any volume group, this actually removes the corresponding /dev/sda2 partition from LVM control and frees it for use outside LVM
  13. fdisk /dev/sda -- to change the type id for the now freed partition (sda2) to 'fd' - Linux raid autodetect" (fdisk inputs: t, 2, fd, w -- to change the type id, on partition 2, to 'fd', and then write).
  14. mdadm --add /dev/md0 /dev/sda2 -- This completes the 2 volume RAID1 mirrored pair named /dev/md0 by adding back the partition (/dev/sda2) from which data was just migrated onto the the array via pvmove. Now mdadm will automatically sync the data that was in the array (physically stored on /dev/sdb2) back over to the device that was added to the array (/dev/sda2).
  15. watch cat /proc/mdstat -- Monitors the sync up progress. This will probably take just about as long as pvmove took to migrate the data to the /dev/md0 PV.
A few VERY Important Final Steps
The question that kept bugging me as I did all this was, "Will the boot process work correctly after these changes next time I reboot?" The answer to that required a few trips back to rescue mode booted from the CentOS install DVD. I suspect I would not have had to make those trips if I had known the following things.
  1. It appeared that the configuration file for mdadm (/etc/mdadm.conf in CentOS 5.x) needed to be correctly configured to reference the mount points and device uuid of the /dev/md0 array. I was still a little fuzzy on how this config file could affect things at boot time. After all, the file is stored in the filesystem that is managed by LVM in a Logical Volume that in turn stores its data as extents in a physical volume (PV) that IS the software RAID array. I couldn't find anything in the /boot/grub setup that defined anything about the raid array or LVM, so how could /etc/mdadm.conf affect the initialization of the software RAID array at boot time, BEFORE the array or LVM is even activated? It doesn't. The boot process happens in stages. The stage where the boot and root filesystems are re-mounted as read/write is where /etc/mdadm.conf matters. So it must be correct. These commands will update it to reflect the actual, active Software RAID arrays.
    mv /etc/mdadm.conf /etc/mdadm.conf.old
    echo "DEVICE partitions" > /etc/mdadm.conf
    echo "MAILADDR root" >> /etc/mdadm.conf
    mdadm --examine --scan >> /etc/mdadm.conf

  2. The init ram drive (initrd) boot image MUST be updated to include support for the software RAID (mdadm/dm-*) modules, so one of the things that I needed to do was build another initrd.?????.img file in the /boot directory (saving the old one just in case). The reason nothing in the grub config files offers a clue to why booting fails is that the relevant stuff is actually embedded within the init ram drive file /boot/initrd-{version}.img. Building a new initrd can be done using mkinitrd to generate an new init script that starts up Software RAID (md) and LVM based on the current, active system configuration. The generated init script embedded inside the initrd-{version}.img file will then have a section like this that bootstraps Software RAID and LVM enough to continue the boot process using resources on a root filesystem that is accessible only with RAID and LVM active:
    ...
    echo Scanning and configuring dmraid supported devices

    raidautorun /dev/md0
    echo Scanning logical volumes
    lvm vgscan --ignorelockingfailure
    echo Activating logical volumes
    lvm vgchange -ay --ignorelockingfailure VolGroup00
    resume /dev/VolGroup00/LogVol01
    echo Creating root device.
    mkrootdev -t ext3 -o defaults,ro /dev/VolGroup00/LogVol00
    echo Mounting root filesystem.
    mount /sysroot
    ...
    So, with all the Software RAID and LVM stuff running, so that mkinitrd will detect the RAID devices (/dev/md*), and detect the LVM configuration (/dev/mapper/*), and include any required boot-time kernel modules, and generate a workable init script, here's what I should have done BEFORE trying to reboot...
    mv /boot/initrd-$(uname -r).img /boot/initrd-$(uname -r).img.bak
    mkinitrd -v /boot/initrd-$(uname -r).img $(uname -r)

This blog post is partially a way to capture some of the solution and share it with someone else who is perplexed by the same issue, and the remainder of it is me taking an opportunity to write something on this blog that isn't totally useless. I hope you have enjoyed it as much as I have.

Addendum on Fixing initrd from a Rescue Environment
As you may have gathered, I tried to reboot before I had initrd-{version}.img updated to reflect the correct boot-time system configuration. I'm sure someone knows a faster, easier way to get initrd fixed from a rescue CD/DVD boot environment, but as is often the case, trudging through it helped me understand better how things work. The process for manually repairing this part of the boot environment consisted of mounting the boot partition, unpacking the initrd file, adding missing files, editing the init script, and repacking the initrd file for the next reboot. If you're in a similar spot, for this or some other reason, the following steps and references may help.

Manually repairing the init script and/or contents in the initrd (init ram drive) image.
REFERENCE: http://www.ibm.com/developerworks/linux/library/l-initrd.html
REFERENCE: http://wiki.openvz.org/Modifying_initrd_image
  1. boot in rescue mode from the install CD/DVD but skip mounting the root filesystem
  2. mount the boot partition
    mkdir /boot
    mount /dev/sda1 /boot
  3. Make a copy of the existing initrd-{version}.img file with a .gz extension and uncompress it
    cd /boot
    cp initrd-2.6.18-53.1.6.el5.img initrd_for_unpack.img.gz
    gunzip initrd_for_unpack.img.gz
  4. Create a temp directory and unpack the initrd.img contents into it
    mkdir initrd_unpacked
    cd initrd_unpacked
    cpio -i --make-directories < ../initrd_for_unpack.img
  5. Find and fix what's wrong with the init script or initrd contents (e.g. add missing modules, correct device name references in the init script, etc.) For example:
    mkdir /mnt/sysimage
    mount {actual-path-to-root-fs} /mnt/sysimage
    cp /mnt/sysimage/lib/modules/{vers}/kernel/drivers/dm-zero.ko lib/
    cp /mnt/sysimage/lib/modules/{vers}/kernel/drivers/dm-mod.ko lib/
    cp /mnt/sysimage/lib/modules/{vers}/kernel/drivers/dm-mirror.ko lib/
    cp /mnt/sysimage/lib/modules/{vers}/kernel/drivers/dm-snapshot.ko lib/
    ...etc...
    vi init
  6. Re-pack a new initrd.img file and compress it
    find ./ | cpio -F newc -o > ../initrd_fixed.img
    cd ..
    gzip initrd_fixed.img
  7. Replace the existing initrd file with the fixed one.
    mv initrd-2.6.18-53.1.6.el5.img initrd-2.6.18-53.1.6.el5.img.bak
    mv initrd_fixed.img.gz initrd-2.6.18-53.1.6.el5.img
  8. Unmount whatever was manually mounted and reboot
    umount /boot
    unmount /mnt/sysimage
    reboot


Saturday, December 19, 2009

Tech Note: Sendmail MASQUERADE_AS

Some email service providers, including the one I use (1and1.com), have recently clamped down on spam by refusing email with an invalid sender domain in the header or the from-address. While I'm glad to see less garbage in my inbox, I had to dive into the Linux / sendmail configuration rabbit hole to get simple notification emails to show up again. The issue is that only the actual registered domain name is likely to be considered valid by these clamped-down email recipients' hosts. In other words, an email from backupsystem@machineabc.myregistereddomain.com will be refused, but backupsystem@myregistereddomain.com would work just fine.

So, after digging through web search results and documentation for a while, I found that sendmail (the service on Linux/Unix/*nix that forwards email messages) can be configured to "masquerade" local email addresses so that they appear to be sent from the base registered domain. So, according to the documentation at http://www.sendmail.org/m4/masquerading.html I set the following configuration options in /etc/mail/sendmail.mc

MASQUERADE_AS(`myregistereddomain.com')
FEATURE(`masquerade_envelope')
FEATURE(`masquerade_entire_domain')
MASQUERADE_DOMAIN(`myregistereddomain.com')

Then, a few commands to activate the configuration...
m4 /etc/mail/sendmail.mc > /etc/mail/sendmail.cf
service sendmail restart

Of course this all has to be done as the root account since the modifications are being made to system configuration files. So, the next step was to send a simple test email to see if it worked, but here's where I stepped on one of those semi-hidden (i.e. inconspicuously documented) Linux software configuration land mines. Still working as the root account, I used the mail command to send a message to a GMail account so I could examine the newly masqueraded headers, from-address, etc.
mail -s "Test Sendmail Masquerading" mytestaccount@gmail.com
type in some text for the message body
type ctrl-d to finish the body, close and send

Note: As of December 2009, GMail still accepts email even if it can't resolve the sender domain, and the GMail message view has a "Show Original" option that displays the original, raw email text including headers.

The Linux aggravation meter went up a notch when I noticed that the headers in the email had NOT been masqueraded as advertised. They were all passed through in the form of root@machineabc.myregistereddomain.com. The land mine I mentioned was found in the generated /etc/mail/sendmail.cf file as the following few lines:
# class E: names that should be exposed as from this host, even if we masquerade
...
C{E}root


That means the root account is excluded from the masquerading rules. This makes sense because most of the email originated using the root account would probably pertain to the specific machine and it would usually be helpful if the machine name were included in the sender header or from-address. However, the recently clamped-down email recipients's hosts simply refuse to accept such emails. The documentation does mention that the root account will always be "exposed" but I hadn't noticed that before. Grrrr.

If you're not a sendmail administration expert, it may not be that obvious how to include the root account in the masquerading. Notification emails delivered to a clamped-down email host that originate from a non-root user on the Linux machine work fine. Just to be sure this will work, enter the same mail command as another user on the machine and it does exactly what the configuration options would lead you to expect. If you want to keep the root account from being excluded from the masquerading, you need to GET RID OF a configuration option, not add another one. Elsewhere in the sendmail.mc file, there is a line that explicitly excludes root as an EXPOSED USER account:

EXPOSED_USER(`root')dnl

Comment it out by changing it to:

dnl # EXPOSED_USER(`root')dnl

And remember to run m4 to regenerate the sendmail.cf file and restart the sendmail service again as described above. Now emails sent from the processes that run as the root account should also get the masquerade treatment.

The other option...
...if you have the ability to manage subdomains, and you only have one or a few machines for which you need the emails to be delivered, then you can add a subdomain to DNS for each of the machines. For instance, if you had machineabc set up in domain myregistereddomain.com, you could add a subdomain to your DNS configuration for machineabc.myregistereddomain.com. Then the clamped-down email servers would look up the sender domain for an email from root@machineabc.myregistereddomain.com, find it, and accept the email as usual.

I'm posting this to save someone (maybe even myself) a little time later on. I searched for a while to see if I could find out why the masquerade options for sendmail were just not working but as usual, they were working, but I was experiencing the effects of the built in exception case and didn't even have a good starting point for a search that would yield an answer.