The K-Zone: Understanding the Linux boot process

Š1994-2004 Kevin Boone

My professional interests

Computing
Law
Education
Science and research

My leisure interests

Martial arts
Heritage railways
Garden railways
Motorcycles
DIY
Philosophy

Downloads

Linux downloads
Windows downloads
Java downloads
Perl downloads
Home automation downloads

About me

Home & family
My CV

Site info

Contact the author
Download policy
Keyword index

Home > Computing > Linux

Understanding the Linux boot process

Last modified: Thu Jul 8 11:41:10 2004

This document explains in moderate detail what happens when a Linux system starts up. As far as possible, I have tried to separate features which are specific to the various Linux distributions from those that are generic. Where this isn't possible -- because the explanation would be too convoluted -- I have used the RedHat set-up as an example. In addition, I have tended to focus on the Intel/PC platform, for the same reason.

To break the process into manageable pieces, I have broken it into four stages: the `firmware' stage, the `bootloader' stage, the `kernel' stage, and the `init' stage. These are my names, and they aren't necessarily used by other Linux users. Moreover, it isn't always easy to separate the `firmware' stage from the initial operations of the bootloader. On the PC platform, the firmware is so unintelligent that a separate (software) bootloader is required. On other platforms, notably Sparc machines, the firmware is quite sophisticated, and may be able to load a kernel directly.

Stage 1 (firmware stage)

The purpose of a bootloader is to get at least part of the operating system kernel into memory and running. After that, the kernel can take over the process. However, unless the bootloader is in firmware, to run the bootloader we must first retrieve it, from disk or wherever else it is stored. The purpose of the firmware stage, therefore, is to get a bootloader into memory and run it.

On the Intel/PC platform, the firmware stage (which does not depend on the operating system) is governed by the BIOS. Most modern PCs (and other types of computer, of course) can boot from floppy disk, hard disk, or CD-ROM. It is common for Sparc-based systems to have built-in network bootloaders in firmware but, at present, this is unusual in the PC world. The BIOS typically provides a mechanism by which the operator can choose the devices that will be used to boot, and it will probably be prepared to try more than one if necessary. The process is slightly different for the different media types.

Bootloader on floppy disk or hard disk

This is usually the simplest situation. On a floppy disk, the first sector is reserved as the boot sector. It must contain executable program code. The BIOS loads the boot sector into memory and then runs it. This process is largely the same whatever the hardware platform.

The situation is similar for PC hard disks, except that it is conventional to divide the hard disk into partitions, and to provide a boot sector for each partition. In the world of DOS, the boot sector was, and remains, combined with the partition table; the partition table controls how much space is allocated to each partition. In addition to the partition boot sectors there is an overall boot sector/partition table called the `master boot record' (MBR). When booting from a hard disk formatted this way, the PC BIOS loads the MBR and executes it as a boot sector; the code in the MBR will then find which partition to boot from, and load and run the boot sector from that partition.

Linux has no need to follow the convention of partitioning that is meaningful to DOS/Windows, but if the hard disk is to be used with more than one operating system then it is a good idea to.

So, when booting from a hard disk the Linux bootloader can be placed in the MBR, or in a partition boot sector. In the latter case, it won't be the BIOS that will load the Linux bootloader, it will be the bootloader on the master boot record.

Whether the boot disk is a hard disk or a floppy disk, the first stage of the boot process finds a boot sector, which will contain the Linux bootloader, and runs it.

Bootloader on CD-ROM

The ability to boot from a CDROM has been commonplace on most platforms for some years. On some platforms a bootable CDROM has the same structure as a bootable hard disk: a boot sector followed by a load of data. A structure like this is unworkable for PCs, owing to limitations in the BIOS specification. Most modern PCs are, however, able to boot from a CDROM formatted according to the El Torito specification. This process is far more complex than it ought to be. Because the BIOS can't cope with a full-sized bootable Linux filesystem on a CDROM, El Torito requires that the CDROM be provided with an additional bootable filesystem. This filesystem is considered to be `outside' the normal data area of the CDROM, and won't be visible if the CDROM is mounted as a filesystem in the usual way. In fact, although the CDROM itself will normally be formatted with an ISO9660 filesystem, the El Torito bootable image can be of any filesystem type. In practise, the bootable image will be formatted as a floppy disk: a boot sector followed by a filesystem. When booting from the CDROM, the BIOS finds the bootable filesystem image, loads the boot sector, and makes the rest of the image available through BIOS calls just as it does for a floppy disk. As far as the bootloader is concerned, therefore, the BIOS treats a bootable CDROM as an ordinary CDROM with an `embedded' bootable floppy disk. Booting from CDROM is therefore just like booting from a floppy disk in practise. With Linux, this embedded floppy disk is usually formatted with an ext2 filesystem. As with a floppy disk, this filesystem will either become the root filesystem for the next phase of the boot process, or will supply a new, compressed filesystem which will be loaded into memory as a `ramdisk' (see below).

The diagram below shows the structure of a typical Linux bootable CD-ROM (but this isn't the only way to do it). The areas aren't to scale, of course: the volume descriptors, etc., are only one sector in length, but the filesystems will be many thousands of sectors. Notice that there is a complete ext2 filesystem in the boot filesystem image, along with the boot sector. The boot sector will normally contain LILO code (see below). The filesystem contains the kernel and the initial ramdisk (see below), and the initial ramdisk in turn contains an ext2 filesystem which will become the root filesystem.

Bootloader retrieved from network

The problem with booting from a network is that the functionality must be supplied in firmware, because if there is no hard disk, there is no practical place to load network-boot software from. Most PCs do not contain firmware this sophisticated, although some network adaptors have this functionality. Sparc-based workstations generally do have network boot functionality -- in the OpenBoot firmware, and it is quite comprehensive. Note that there is nothing to stop a PC getting a bootloader with network capabilities from, say, a hard disk or CDROM and then using this to complete the boot process over the network. However, this is not network booting in the sense I am describing here.

To get a bootloader via the network, the workstation must first of all decide where to get it from. This may be configurable at the firmware level or, more often, the workstation will issue a broadcast, and then select a boot server from the replies. Sun Sparc systems typically make a RARP request, broadcasting their hardware MAC address (`Ethernet address'). The reply from the server will contain the IP number assigned to the workstation, and that of the server itself. The workstation then uses the server's IP as the target for a TFTP download. Whether this download retrieves a network-aware bootloader, or a whole kernel, varies from one system to another. Some Sparc systems are able to TFTP a Linux kernel and load it, other require the retrieval of a network-aware bootloader which then retrieves the kernel (this is how Linux can be made to run on the Sun Javastation network appliance, which has somewhat stunted firmware).

Stage 2 (bootloader stage)

So we've got a bootloader into memory, from disk or network, and it can be executed. Its job will be to get the kernel into memory, again either from disk or network, and execute it. The bootloader will have to supply various vital pieces of information to the kernel, crucially the location of its root filesystem.

There are a number of bootloaders available for Linux: on the Intel/PC platform we have LILO and GRUB; on Sparc we have SILO. LILO is probably the best known, and has existed since the earliest days of Linux. SILO is essentially the Sparc port of LILO. GRUB is a much more sophisticated proposition.

LILO

LILO is a very rudimentary, single-stage bootloader. It has little or no knowledge of Linux, and does not understand the structure of any filesystem. Instead, it reads from the disk using BIOS calls, supplying numerical values for the locations on disk of the files it needs. Where does it get these values from? It has no way to figure them out at run-time, so the LILO installer has to supply them in the form of a `map' file. The LILO installer is a utility called lilo; this utility reads a configuration file and builds the map file from it. The location of the map file is then supplied to the boot sector that lilo installs.

The bootloading process with LILO thus looks something like this.

The firmware loads the LILO bootsector and executes it.
LILO loads its map file using BIOS calls. Using the map file it finds the location of the boot message, which it displays to the console, followed by a prompt.
The user selects which kernel to boot -- if there's more than one -- at the prompt
LILO loads the kernel using BIOS calls, based on information in the map file it loaded earlier
(optional) LILO loads the initial ramdisk (see below)
LILO executes the kernel, indicating where it can find its root filesystem and (if necessary) initial ramdisk

A problem with LILO is that it can be quite tricky to use it for creating a boot sector for a system different to the one running the LILO installer (lilo). The LILO configuration file (usually /etc/lilo.conf) takes the names of files and devices as its inputs, but these names are never passed through to the boot sector being created. The files and devices referenced are simply analysed for their numerical offsets. For example, if lilo.conf contains the line

root=/dev/cdrom

and /dev/cdrom is a symbolic link to the real device file (perhaps /dev/hdc), it is important to understand that all lilo will store is the major and minor device identifiers of /dev/hdc. It is easy to imagine that if the bootable filesystem you are building contains a file called /dev/cdrom, and that is a link to, say, /dev/hdd, then the root filesystem will be found on /dev/hdd. But it won't; LILO does not understand filesystems, and the names in the configuration file are simply rendered down to device IDs and file sector locations.

GRUB

GRUB is a very different bootloader from LILO. It has a two-stage or three-stage operation, and has network boot capabilities (of course, the network boot facilities don't give you a way to get GRUB itself loaded: you'll still need network boot firmware).

The additional sophistication of GRUB means that it can't easily fit into a single boot sector. It therefore uses a multiple-stage process to load successively larger amounts into memory. In so doing it becomes able to understand filesystems, so the kernel itself, and the other files GRUB uses, can be specified dynamically at boot time; there is no need for explicit numerical maps such as the ones that LILO uses.

In brief, the GRUB boot process looks like this.

Stage 1: the firmware loads the GRUB boot sector into memory. This is a standard (512 byte) boot sector and, thus far, the process is the same as for lilo. Encoded in the boot sector are the numerical disk block addresses of the sectors that make up the implementation of the next stage. GRUB then loads the blocks that are required for the next stage using BIOS calls.
Stage 1.5 (this name reflects the fact that, strictly speaking, it is optional; its purpose is to load the code that recognizes real filesystems, and GRUB can be set to use numerical block offsets just like LILO): the code for stage 2 is loaded using BIOS calls, but with knowledge of the filesystem. Typically this code is in the file /boot/grub/stage2. On my system this program is about 120 kB in size; clearly we can offer far more sophisticated functionality in a program of this size than in the 5000-or-so bytes of LILO. The fact that GRUB loads its second stage as a file, and not as a list of disk sectors, is the key to its power; LILO can't do this, so you can't do much with it at boot time.
Stage 2: GRUB puts up a menu of defined boot options, and exposes a command-line to the operator. The command line can be used to load arbitrary files as kernels and ramdisks (because stage 2 understands filesystems). Each boot option in the GRUB configuration file is expressed in terms of GRUB command-line operations.
GRUB executes the commands entered by the operator, either from the configuration file or from the command line prompt. Typical commands are kernel, which loads a kernel into memory, initrd, which loads an initial ramdisk from a file, and boot.

The functionality offered by GRUB is quite similar to the OpenBoot firmware in Sun workstations, and includes the ability to retrieve kernels from a server using TFTP.

Multiple-boot machines

Because Linux was designed to be able to co-exist with other operating systems, the bootloader should be able to boot other operating systems on a hard disk as well as Linux. In practise this is relatively straightforward, as each of the other operating systems will have its own boot sector. All the Linux boot loader has to do is to locate the appropriate boot sector, and execute it. After that, the process will be under the control of the other system's bootloader. LILO, GRUB, and SILO all have this functionality.

Stage 3 (kernel stage)

By the time this stage begins, the bootloader will have loaded the kernel into memory, configured it with the location of its root filesystem, and loaded the initial ramdisk, if supplied. How we proceed from here depends to a large extent on whether we are using an initial ramdisk or not.

So why is an initial ramdisk such a big deal? Well, the concept arose from attempts to solve the problem of fitting a fully bootable Linux system onto a single floppy disk. The problem is that a Linux system that will boot as far as giving a shell, and offering a few basic utilities, needs about 8Mb -- far too much to fit onto a floppy. However, such a system will in practise compress down to about 2 Mb using gzip compression, so if the root filesystem could be compressed, we could get a working system in two standard floppies, or a single 2.88 Mb floppy.

Another problem that had to be solved was that of booting from a floppy disk and then mounting a root filesystem from a device other than an IDE drive. SCSI drives were particularly problematic: if the kernel was compiled to included all the necessary drivers, it would not fit onto a floppy disk. However, the initial ramdisk technique allows the drivers to be supplied as loadable modules, which can be compressed.

In outline, an initial ramdisk is a root filesystem that is unpacked from a compressed file. The boot loader will load the compressed version into memory, then the kernel uncompresses it and mounts it as the root filesystem. In this way we can get an 8 Mb root filesystem onto a 2.88 Mb file. Initial ramdisks are also useful on bootable CDROMs, because the bootable part of the CDROM is typically implemented as an `embedded' floppy disk.

Stage 3a (common kernel stage)

Whether or not we are using an initial ramdisk, the kernel will begin initializing itself and the hardware devices for which support is compiled in. The process will typically include the following steps.

Detect the CPU and its speed, and calibrate the delay loop
Initialize the display hardware
Probe the PCI bus and build a table of attached peripherals and the resources they have been assigned
Initialize the virtual memory management system, including the swapper kswapd
Initialize all compiled-in peripheral drivers; these typically include drivers for IDE hard disks, serial ports, real-time clock, non-volatile RAM, and AGP bus. Other drivers may be compiled in, but it is increasingly common to compile as stand-alone modules those drivers that are not required during this stage of the boot process. Note that drivers must be compiled in if they are needed to support the mounting of the root filesystem. If the root filesystem is an NFS share, for example, then drivers must be compiled in for NFS, TCP/IP, and low-level networking hardware

If we aren't using an initial ramdisk, then the next step is to mount the root filesystem. The kernel can then run the first true process from the root filesystem (strictly speaking, kswapd and its associates are not processes, they are kernel threads). Conventionally this process is /sbin/init, although the choice can be overridden by supplying the boot= parameter to the kernel at boot time. The init process runs with uid zero (i.e., as root) and will be the parent of all other processes.

Note that kswapd and the other kernel threads have process IDs but, even though they start before init, init still has process ID 1. This is to maintain the Unix convention that init is the first process.

Stage 3b (ramdisk kernel stage)

This stage is only relevant if we are using an initial ramdisk. In this case, the kernel won't involve init, but will proceed as follows.

The kernel unpacks the compressed ramdisk into a normal, mountable ramdisk
It then mounts the uncompressed ramdisk as a root filesystem. The original ramdisk memory is freed. It should be obvious that the kernel must have drivers compiled in to support whatever filesystem is in the ramdisk, as it won't be able to load any modules until the root filesystem is visible.
The kernel then runs an initialization process. This process will, in general, not be the standard unix init, but a script that will mount the real root filesystem and then launch the next stage of the boot process. Conventionally this script is called /linuxrc but it can be specified to the kernel using the init parameter.
/linuxrc does whatever it needs to, in order to make the real root filesystem available, probably including loading some modules. It then mounts the new root filesystem over the top of the ramdisk filesystem.
Conventionally /linuxrc then spawns the `real' init process. It will typically do this using the exec command so that init ends up as process number 1, rather than 2.

/linuxrc need not mount a new root filesystem over the top of the ramdisk root, nor need it load init. These activities are simply conventions. For example, in order to boot a full Linux system from a CDROM, a workable proposition is to retain the initial ramdisk as the root filesystem, and have /linuxrc mount the CDROM at, say, /usr. This allows the root filesystem to be read-write; if we mounted the CDROM at /, the root filesystem would be read-only, and we would have to create a separate ramdisk and have a bunch of symbolic links from the CDROM to parts of that ramdisk.

Similarly, a `rescue' disk -- floppy or CDROM -- would probably not want to invoke init, but simply put up a root shell.

If we are using /linuxrc to prepare a root filesystem, it is a good idea to minimize the amount of initialization code in it. This is not because it won't work, but because the correct place for initialization is in the start-up script spawned by init. Doing initialization here, and not in /linuxrc enables us to ensure that the same initialization code is available whether or not an initial ramdisk is in use.

Stage 4 (init stage)

By now the kernel is loaded, memory management is running, some hardware is initialized, and the root filesystem is in place. All subsequent operations are invoked -- directly or indirectly by init. This process takes its instructions -- again by default -- from the file /etc/inittab. inittab specifies at least three important pieces of information.

the `runlevel' to enter at startup
a command to run to perform basic system initialization (conventionally this is /etc/rc.sysinit)
the commands to run on entry to and exit from particular runlevels.

The order of operations is that the initialization command (rc.sysinit) is run first, then the runlevel scripts. The division of work between rc.sysinit and the runlevel scripts is entirely a convention. If you are building a custom Linux system you don't have to follow this convention. In fact, you don't even have to run init if it doesn't do what you need.

Stage 4a (rc.sysinit)

This script or executable is responsible for all the one-off initialization of the system. Linux distributions differ in the distribution of work between this script and the runlevel scripts but, in general, the following initialization steps are likely to be carried out here.

Configure the system clock from the hardware clock
Set up keymappings for the console(s)
Mount the /proc filesystem
Set up swap space (if there is any)
Mount and check `local' (i.e., non-network) filesystems
Run depmod to initialize the module dependency tree. This is important because it makes it possible for modprobe to work. The kernel's module auto-loader refers to modules by name, not by filename. It also expects that when it tries to load a module by name, any modules on which it depends can also be loaded by name. In a custom boot set-up, you may prefer to load all your modules by filename, and not compile in the auto-loader at all. This speeds the boot process considerably. However, you'll lose the flexibility of dynamically loading and unloading modules for hot-plug devices.
Initialize and configure network interfaces. This step usually has to come after the depmod step or its equivalent, because the network drivers are likely to be loaded as modules.
Load drivers for USB, PCMCIA, sound, etc. Again, these steps probably load or reference modules.

Stage 4b (runlevel scripts)

Let's assume that we will be entering runlevel 5 which, by convention, gives us a graphical login prompt under the X server. A typical inittab will have entries like this:

l5:5:wait:/etc/rc.d/rc 5
x:5:respawn:/etc/X11/prefdm -nodaemon

The first line says that on entry to runlevel 5, invoke a script called rc, passing the argument `5'. The second line says that on entry to runlevel 5, run the script /etc/X11/prefdm -nodaemon. This latter script is somewhat beyond the scope of this article, being in the realm of X display management. In outline, prefdm is a script inserted by the RedHat installer. It contains code that will launch the X display manager selected by the user, either at install time or using a configuration utility. The reason it works this way is so that configuration utilities don't have to mess about with inittab, which is a bad file to mess up if you want your system to keep working. The X display manager will typically invoke the X server (i.e., the graphical display) on the local machine and give you a login prompt.

But back to the `real' boot process... The script rc runs the start scripts in a directory for the runlevel given in inittab. Usually, runlevel N will correspond to a directory /etc/rc.d/rcN.d. As we've decided to enter runlevel 5, the relevant directory is /etc/rc.d/rc5.d. This directory will contain a (possibly large) number of scripts with names beginning with `S' or `K' followed by two digits, e.g., S12syslog. The digits denote the order in which the scripts are executed: The `S' scripts are executed in ascending numerical order on entry to the runlevel (i.e., at boot), and the `K' scripts are executed in descending order on exit (usually at shutdown). rc passes the argument `start' to each script at startup, and `stop' and shutdown. As a result, we don't really need both `S' and `K' scripts, because we can use the argument to determine whether we are starting or stopping. Thus it is a convention on Linux systems that the K scripts are simply symbolic links to their corresponding S scripts, and the S scripts do both startup and shutdown operations.

So, for example, when entering runlevel 5, somewhere near the beginning of the rc process we will execute

S12syslog start

On shutdown, somewhere towards the end of the shutdown process we will do

K12syslog stop

which is, in fact, an invocation of

S12syslog stop

Inside the script S12syslog -- and most of the other scripts in that directory -- you will find both initialization and finalization code. So what do these scripts do? Well, this depends on the runlevel, and the distribution, and any customizations you have made. A typical set of operations will included the following:

Apply firewall settings to IP network interfaces
Bring up non-IP networking (e.g., IPX, appletalk)
Start the system logger
Start the NFS portmapper, lock daemon, etc., and mount any NFS shares specified in /etc/fstab.
Start the power management daemon
Initialize the auto-mounter
Initialize the PCMCIA subsystem, loading drivers and daemons both for the PCMCIA hardware itself, and any cards that are currently inserted
Start up the inet daemon (inetd or xinetd) which will take care of accepting incoming network connections
Start the printer daemon
Start cron
Start the X font server

The very last step in the boot process will be to run a script S99local. This is the conventional place to put machine-specific initialization. It is considered bad manners to customize any of the initialization scripts that are supplied as part of a Linux distribution, simply because other people who may have to manage the system will have expectations about what is in them. Making arbitrary changes here will defeat these expectations. However, everybody expects to see machine-specific configuration in S99local.

Gotchas

It should be clear that the boot process on a fully-featured Linux system is fairly complex. You can simplify it a great deal if you are building a custom Linux system, or if you just want your machine to start up faster. However, there are a few things to watch out for when constructing a custom boot process.

The various stages of the boot process are quite well separated, particularly the bootloader stage and the kernel stage. What does this mean? Well, imagine a situation in which we download a network bootloader, which loads a kernel from a file server. The kernel then starts up and wants to initialize its network settings (IP number, etc). Now, the machine had to have an IP number during the bootloader stage, didn't it? Otherwise, how would it have been able to do network operations to fetch the kernel? So one might expect that the kernel could simply get its IP number from the bootloader. The problem is that Linux bootloaders don't know how to supply this information to the kernel. Why should they? The bootloader designers can't anticipate everything that the kernel might need to know in advance. Therefore the kernel must then have the machine find its IP number, etc., again, independently of what the bootloader may have done. In practise, it's probably going to get the same IP number, but that makes no difference. This causes problems for people who want to build a fully-diskless installations (like the Javastation example elsewhere on this site). Your network-boot firmware probably uses RARP or DHCP to find the machine's network settings, but that doesn't mean that you don't need to include the same support in the kernel when you build it: the kernel will have to do it again. When you come to mounting the root filesystem as an NFS mount, you need to make sure that you have a way to tell the kernel where the NFS server is (usually via kernel command-line parameters, but on some dumb systems you have to hard code them into the kernel before compilation). The kernel has no way to know whether the machine that replied to the RARP or DHCP request is going to be the one to supply the root filesystem.
Another problem, which appears different but is in fact identical, is that of booting from SCSI devices. So, you have a PC or workstation that can boot from a SCSI CDROM drive. The firmware loads the boot sector, which initializes the bootloader, which loads the kernel. So far so good. Then the kernel takes over. It tries to mount the SCSI CDROM as a filesystem, but fails. Why? Because SCSI drivers aren't included in the kernel. It is wrong to believe that, because the system can read from a CDROM during boot, that the kernel will be able to read from the CDROM. The kernel won't use BIOS calls to read the CDROM, which is what the bootloader will probably do. The kernel will use the standard Linux VFS (virtual filesystem) infrastructure, which will communicate with the SCSI infrastructure, which will communicate with the low-level SCSI device driver, which will communicate with the hardware. To boot a kernel from a SCSI CDROM, you need to make sure that all of these components are available to the kernel (in the form of modules), or are compiled in.

Search

Shameless plug

By the author of this site. Buy on-line from Amazon USA | UK

Editorial

So you want to be a university lecturer? Read this first!

Speak like your boss: new developments in managerese

Computing features

File handling in the Linux kernel: an in-depth look at how Linux handles files, filesystems, and file I/O

Some advice on Using Linux with the PalmOne Treo 600 PDA-phone:

Confused about CLASSPATH? answers are here

First steps in EJB using jBoss (recently revised for jBoss 3.2)