Monday, September 13, 2010

Fighting FakeRAID and GRUB problems

Everybody knows that RAID can be an effective solution for redundancy and general availability, but in a real professional environment it is clearly not enough - and a very careful analysis of all the possible point of failure should be made (and people that believe having a single expensive SAN with RAID5 inside 'covers everything' at enterprise level should be quickly and quietly removed from decision positions).

However for very cheap (home or really-small business) some compromises have to be made, and given the abysmal reliability but acceptable price of the more recent generations of hard-disks it is now a decent solution to use some form of low-cost RAID 1 (mirroring) even for home use.

Now if you have some basic technical skills with computers you already know about the 'integrated RAID' that is now provided at basically no extra cost on most of the modern motherboards, and if you also have some decent Linux skills you also know that the type of RAID provided by those solutions is very specific in needing some (serious) OS-side support and as such it is referred in the Linux world as FakeRAID and in the purist circles is always derided - however that is a very narrow and slightly shortsighted view (of the same kind as 'I don't see SMP becoming mainstream …') - for certain scenarios the FakeRAID can be a very effective solution if you know what you are doing!

So how bad is FakeRAID ? Well, right now it is not as polished as some of the other OS-level software approaches (like volumes in Windows or Linux) and it is not as fast as some of the very expensive dedicated hardware solutions, however:

a) it is much better for interoperability among different Operating Systems than any of the other software solutions - think of it at hardware vendors forcing Linux and Microsoft to come to an agreement on volume (basic) formats :) On top of that it also perfectly fixes the initial boot problem (since the BIOS is aware of it, unlike for instance a Microsoft proprietary solution).

b) it can be a LOT cheaper and also 'easier to recover' than a dedicated hardware solution.

So what problems should you be aware of ?

First of all there is still some lock-in but this time from the hardware vendors - there are a number of FakeRAID formats and those are not 'playing nice' one with another so a FakeRAID from one type of motherboard most likely will not work on a totally different motherboard - you can see a more detailed discussion here but as a general quick rule of thumb you must remember that ideally you should replace the motherboard in such a system with another motherboard with the same chipset for a full 'instant replacement' !

If that kind of drop-in replacement is not possible things are still not completely lost - since Linux can still see such FakeRAID volumes even on some slightly different hardware (because the differences are coming from BIOS level) - so you can still perfectly read all you data, but this time you might need some extra storage to save the data to, then you can re-arrange things in BIOS for the new FakeRAID format and then restore the saved data - far more time-consuming but still doable if you care about your data - which is precisely what happened to one of my systems :)

So which are the problems mentioned in the title ? Well, once the theory from above was well clarified (and I was no longer trying to get a 'perfect solution' since the old chipset could no longer be found) I decided to reinstall clean versions of the Operating Systems - things worked very well for Windows (except Windows 2000 which seems to no longer be supported by all FakeRAID vendors), but surprisingly the boot process was looking very tricky on Linux, specifically with some of my USB 'recovery tools' and with new installations. The problem was apparently related to RAID support (since in some early attempts it was booting OK without the RAID) - so I lost a huge amount of time experimenting with a number of distributions - most of which were starting OK from the installation DVD or LiveDVD but were rather tricky in setting the partitions and all without any exception were finally getting to the same end-point - a non-bootable system at GRUB level !!!

I will not list all the stupid tests I have done to clarify that strange behavior but instead I will fast-forward to the actual conclusion - the problem WAS related to RAID, but not to the actual OS-level RAID (well, most of the time at least, see below the paragraph on Ubuntu) but instead to something else simpler but far more surprising - the free-memory pattern of the low 640k of the memory which was changed when the RAID BIOS was activated !!!

So the bottom line is that even to this day there is a (dumb) bug in both GRUB and GRUB2 which 'assumes' that the low memory is pretty free in some fixed (???) addresses from (linear) 0x90000 to about 0x9A000 - where some initial parts of the kernel (and also memtest) is loaded/started (in real mode) !!! And of course that assumption was NOT true once the extra RAID BIOS was active - so no surprise that the system was never bootable! You can easily check for that problem with displaymem at the GRUB legacy (which still is a LOT better than GRUB2 when something goes wrong) command prompt - if you see any reserved block in that range you will see booting problems!

So how on earth those Linux DVDs were still starting so well ? Well, that also explained something that I have observed starting a few years ago - pretty much all serious distributions are using ISOLINUX to boot from CD/DVD (even if GRUB was looking easier to setup and handle) - but of course ISOLINUX / SYSLINUX / EXTLINUX are all working OK even with strange low-640k memory maps - so probably without ever getting to describe the full (potential) problems, the maintainers were just automatically choosing the boot method which 'was working' !!!

Once that become clear the obvious solution was to switch to EXTLINUX - which is a lot 'different' than you would think, since as far as I know there is no large-scale friendly distribution using it as the final HDD bootloader - many distributions are including it somewhere but are optimized around having GRUB and more recently GRUB2 as the general system bootloader which has to be 'automagically' updated after installing a new kernel or anything :(

So the final solution was to install Ubuntu - one extra 'trick requirement' was that the above system was also somehow used for certain tests, and I wanted both a 32-bit and a 64-bit system side-by-side - 'forcing' GRUB legacy on both (but with different targets - I used the partition table for the entire disk for the 32-bit version and the actual Linux64 partition for the 64-bit version), and then 'chain' from GRUB to a manually-maintained EXTLINUX for the actual booting - things work OK as long as I remember to manually update the EXTLINUX configuration file after major kernel updates. There was also an extra complication - somehow along my many tests I ended with a separate small /boot partition (located in the first 8 GB of the disk - at some point in the tests I was ready to accept even the most fantastic explanations, like for instance the BIOS not being able to read in real mode past 8 GB), where EXTLINUX also is placed (starting from the boot sector of that partition) and I have to copy there the kernel and initrd files - and with that occasion I also noticed that Ubuntu kernel/initrd files do not have anywhere in the name some 'architecture marks' - like x86 or x64 somewhere in the name - so basically I keep them in separate folders to avoid inherent name clashes :)

The other tricky part with Ubuntu was handling RAID partitions - it seems that the current 10.04 LTS neither in the LiveCD nor the AlternateCD is very prepared to handle the changes after you create or seriously change partitions on FakeRAID - so ideally you just create everything on the first boot from one of those CDs and then just reboot - when the installer will detect the partitions just fine and go ahead without any problems! Surprisingly the Alternate x64 also has some bugs, so I ended installing from Alternate x86 and Live x64 - but the resulting install worked equally well once correctly set with EXTLINUX !

So the bottom line is that sometimes you might still see unexpected things with Linux, but a little persistence (and good internet searching skills) can provide a quick workaround - and hopefully during the long run the GRUB bug from above will be fixed and I will no longer need to keep both GRUB and SYSLINUX on my USB recovery tools :)