Cluster

Tmpfs and Bind Mounts

Introduction

In my previous articles in this series, I introduced the benefits of journaling and the ReiserFS and showed how to set up a rock-solid ReiserFS system. In this article, we’re going to tackle a couple of semi-offbeat topics. First, we’ll take a look at tmpfs, also known as the virtual memory (VM) filesystem. Tmpfs is probably the best RAM disk-like system available for Linux right now, and was introduced with Linux kernel 2.4. Then, we’ll take a look at another capability introduced with Linux kernel 2.4 called “bind mounts”, which allow a great deal of flexibility when it comes to mounting (and remounting) filesystems.

Introducing Tmpfs

If I had to explain tmpfs in one breath, I’d say that tmpfs is like a ramdisk, but different. Like a ramdisk, tmpfs can use your RAM, but it can also use your swap devices for storage. And while a traditional ramdisk is a block device and requires a mkfs command of some kind before you can actually use it, tmpfs is a filesystem, not a block device; you just mount it, and it’s there. All in all, this makes tmpfs the niftiest RAM-based filesystem I’ve had the opportunity to meet.

Tmpfs and VM

Let’s take a look at some of tmpfs’s more interesting properties. As I mentioned above, tmpfs can use both RAM and swap. This might seem a bit arbitrary at first, but remember that tmpfs is also known as the “virtual memory filesystem”. And, as you probably know, the Linux kernel’s virtual memory resources come from both your RAM and swap devices. The VM subsystem in the kernel allocates these resources to other parts of the system and takes care of managing these resources behind-the-scenes, often transparently moving RAM pages to swap and vice-versa.

The tmpfs filesystem requests pages from the VM subsystem to store files. tmpfs itself doesn’t know whether these pages are on swap or in RAM; it’s the VM subsystem’s job to make those kinds of decisions. All the tmpfs filesystem knows is that it is using some form of virtual memory.

Not a Block Device

Here’s another interesting property of the tmpfs filesystem. Unlike most “normal” filesystems, like ext3, ext2, XFS, JFS, ReiserFS and friends, tmpfs does not exist on top of an underlying block device. Because tmpfs sits on top of VM directly, you can create a tmpfs filesystem with a simple mount command:

# mount tmpfs /mnt/tmpfs -t tmpfs

After executing this command, you’ll have a new tmpfs filesystem mounted at /mnt/tmpfs, ready for use. Note that there’s no need to run mkfs.tmpfs; in fact, it’s impossible, as no such command exists. Immediately after the mount command, the filesystem is mounted and available for use, and is of type tmpfs. This is very different from how Linux ramdisks are used; standard Linux ramdisks are block devices, so they must be formatted with a filesystem of your choice before you can use them. In contrast, tmpfs is a filesystem. So, you can just mount it and go.

Tmpfs Advantages

Dynamic Filesystem Size

You’re probably wondering about how big that tmpfs filesystem was that we mounted at /mnt/tmpfs, above. The answer to that question is a bit unexpected, especially when compared to disk-based filesystems. /mnt/tmpfs will initially have a very small capacity, but as files are copied and created, the tmpfs filesystem driver will allocate more VM and will dynamically increase the filesystem capacity as needed. And, as files are removed from /mnt/tmpfs, the tmpfs filesystem driver will dynamically shrink the size of the filesystem and free VM resources, and by doing so return VM into circulation so that it can be used by other parts of the system as needed. Since VM is a precious resource, you don’t want anything hogging more VM than it actually needs, and the great thing about tmpfs is that this all happens automatically.

Speed

The other major benefit of tmpfs is its blazing speed. Because a typical tmpfs filesystem will reside completely in RAM, reads and writes can be almost instantaneous. Even if some swap is used, performance is still excellent and those parts of the tmpfs filesystem will be moved to RAM as more free VM resources become available. Having the VM subsystem automatically move parts of the tmpfs filesystem to swap can actually be good for performance, since by doing so, the VM subsystem can free up RAM for processes that need it. This, along with its dynamic resizing abilities, allow for much better overall OS performance and flexibility than the alternative of using a traditional RAM disk.

No Persistence

While this may not seem like a positive, tmpfs data is not preserved between reboots, because virtual memory is volatile in nature. I guess you probably figured that tmpfs was called “tmpfs” for a reason, didn’t you? However, this can actually be a good thing. It makes tmpfs an excellent filesystem for holding data that you don’t need to keep, such as temporary files (those found in /tmp) and parts of the /var filesystem tree.

Using Tmpfs

To use tmpfs, all you need is a modern (2.4+) kernel with Virtual memory file system support (former shm fs) enabled; this option lives under the File systems section of the kernel configuration options. Once you have a tmpfs-enabled kernel, you can go ahead and mount tmpfs filesystems. In fact, it’s a good idea to enable tmpfs in all your kernels if you compile them yourself – whether you plan to use tmpfs or not. This is because you need to have kernel tmpfs support in order to use POSIX shared memory. System V shared memory will work without tmpfs in your kernel, however. Note that you do not need a tmpfs filesystem to be mounted for POSIX shared memory to work; you simply need the support in your kernel. POSIX shared memory isn’t used too much right now, but this situation will likely change as time goes on.

Avoiding low VM conditions

The fact that tmpfs dynamically grows and shrinks as needed makes one wonder: what happens when your tmpfs filesystem grows to the point where it exhausts all of your virtual memory, and you have no RAM or swap left? Well, generally, this kind of situation is a bit ugly. With kernel 2.4.4, the kernel would immediately lock up. With more recent kernels, the VM subsystem has in many ways been fixed, and while exhausting VM isn’t exactly a wonderful experience, things don’t blow up completely, either. When a modern kernel gets to the point where it can’t allocate any more VM, you obviously won’t be unable to write any new data to your tmpfs filesystem. In addition, it’s likely that some other things will happen. First, the other processes on the system will be unable to allocate much more memory; generally, this means that the system will most likely become extremely sluggish and almost unresponsive. Thus, it may be tricky or unusually time-consuming for the superuser to take the necessary steps to alleviate this low-VM condition.

In addition, the kernel has a built-in last-ditch system for freeing memory when no more is available; it’ll find a process that’s hogging VM resources and kill it. Unfortunately, this “kill a process” solution generally backfires when tmpfs growth is to blame for VM exhaustion. Here’s the reason. Tmpfs itself can’t (and shouldn’t) be killed, since it is part of the kernel and not a user process, and there’s no easy way for the kernel to find out which process is filling up the tmpfs filesystem. So, the kernel mistakenly attacks the biggest VM-hog of a process it can find, which is generally your X server if you happen to be running one. So, your X server dies, and the root cause of the low-VM condition (tmpfs) isn’t addressed. Ick.

Low VM: the solution

Fortunately, tmpfs allows you to specify a maximum upper bound for the filesystem size when a filesystem is mounted or remounted. Actually, as of kernel 2.4.6 and util-linux-2.11g, these parameters can only be set on mount, not on remount, but we can expect them to be settable on remount sometime in the near future. The optimal maximum tmpfs size setting depends on the resources and usage pattern of your particular Linux box; the idea is to prevent a completely full tmpfs filesystem from exhausting all virtual memory and thus causing the ugly low-VM conditions that we talked about earlier. A good way to find a good tmpfs upper-bound is to use top to monitor your system’s swap usage during peak usage periods. Then, make sure that you specify a tmpfs upper-bound that’s slightly less than the sum of all free swap and free RAM during these peak usage times.

Creating a tmpfs filesystem with a maximum size is easy. To create a new tmpfs filesystem with a maximum filesystem size of 32 MB, type:

# mount tmpfs /dev/shm -t tmpfs -o size=32m

This time, instead of mounting our new tmpfs filesystem at /mnt/tmpfs, we created it at /dev/shm, which is a directory that happens to be the “official” mount point for a tmpfs filesystem. If you happen to be using devfs, you’ll find that this directory has already been created for you.

Also, if we want to limit the filesystem size to 512 KB or 1 GB, we can specify size=512k and size=1g, respectively. In addition to limiting size, we can also limit the number of inodes (filesystem objects) by specifying the nr_inodes=x parameter. When using nr_inodes, x can be a simple integer, and can also be followed with a k, m, or g to specify thousands, millions, or billions (!) of inodes.

Also, if you’d like to add the equivalent of the above mount tmpfs command to your /etc/fstab, it’d look like this:

tmpfs   /dev/shm        tmpfs   size=32m        0       0

Mounting On Top of Existing Mount Points

Back in the 2.2 days, any attempt to mount something to a mount point where something had already been mounted resulted in an error. However, thanks to a rewrite of the kernel mounting code, using mount points multiple times is not a problem. Here’s an example scenario: let’s say that we have an existing filesystem mounted at /tmp. However, we decide that we’d like to start using tmpfs for /tmp storage. In the old days, your only option would be to unmount /tmp and remount your new tmpfs /tmp filesystem in its place, as follows:

#  umount /tmp
#  mount tmpfs /tmp -t tmpfs -o size=64m

However, this solution may not work for you. Maybe there are a number of running processes that have open files in /tmp; if so, when trying to unmount /tmp, you’d get the following error:

umount: /tmp: device is busy

However, with Linux 2.4+, you can mount your new /tmp filesystem without getting the “device is busy” error:

# mount tmpfs /tmp -t tmpfs -o size=64m

With a single command, your new tmpfs /tmp filesystem is mounted at /tmp, on top of the already-mounted partition, which can no longer be directly accessed. However, while you can’t get to the original /tmp, any processes that still have open files on this original filesystem can continue to access them. And, if you umount your tmpfs-based /tmp, your original mounted /tmp filesystem will reappear. In fact, you can mount any number of filesystems to the same mount point, and the mount point will act like a stack; unmount the current filesystem, and the last-most-recently mounted filesystem will reappear from underneath.

Bind Mounts

Using bind mounts, we can mount all, or even part of an already-mounted filesystem to another location, and have the filesystem accessible from both mount points at the same time! For example, you can use bind mounts to mount your existing root filesystem to /home/drobbins/nifty, as follows:

#  mount --bind / /home/drobbins/nifty

Now, if you look inside /home/drobbins/nifty, you’ll see your root filesystem (/home/drobbins/nifty/etc, /home/drobbins/nifty/opt, etc.). And if you modify a file on your root filesystem, you’ll see the modifications in /home/drobbins/nifty as well. This is because they are one and the same filesystem; the kernel is simply mapping the filesystem to two different mount points for us. Note that when you mount a filesystem somewhere else, any filesystems that were mounted to mount points inside the bind-mounted filesystem will not be moved along. In other words, if you have /usr on a separate filesystem, the bind mount we performed above will leave /home/drobbins/nifty/usr empty. You’ll need an additional bind mount command to allow you to browse the contents of /usr at /home/drobbins/nifty/usr:

#  mount --bind /usr /home/drobbins/nifty/usr

Bind mounting parts of filesystems

Bind mounting makes even more neat things possible. Let’s say that you have a tmpfs filesystem mounted at /dev/shm, its traditional location, and you decide that you’d like to start using tmpfs for /tmp, which currently lives on your root filesystem. Rather than mounting a new tmpfs filesystem to /tmp (which is possible), you may decide that you’d like the new /tmp to share the currently mounted /dev/shm filesystem. However, while you could bind mount /dev/shm to /tmp and be done with it, your /dev/shm contains some directories that you don’t want to appear in /tmp. So, what do you do? How about this:

# mkdir /dev/shm/tmp
# chmod 1777 /dev/shm/tmp
# mount --bind /dev/shm/tmp /tmp

In this example, we first create a /dev/shm/tmp directory and then give it 1777 perms, the proper permissions for /tmp. Now that our directory is ready, we can mount /dev/shm/tmp, and only /dev/shm/tmp to /tmp. So, while /tmp/foo would map to /dev/shm/tmp/foo, there’s no way for you to access the /dev/shm/bar file from /tmp.

As you can see, bind mounts are extremely powerful and make it easy to make modifications to your filesystem layout without any fuss. Next article, we’ll check out devfs; for now, you may want to check out the following resources.

Veritas File System

VxFS

The Veritas filesystem and volume manager have their roots in a fault-tolerant proprietary minicomputer built by Veritas in the 1980s. They have been available for Solaris since at least 1993 and have been ported to AIX and Linux. They are integrated into HP-UX and SCO UNIX, and Veritas Volume Manager code has been used (and extensively modified) in Tru64 UNIX and even in Windows. Over the years, Veritas has made a lot of money licensing their tech, and not because it is cheap, but because it works.

VxFS has never been part of Solaris but, when UFS was the only option, it was a popular addition. VxVM and VxFS are tightly integrated. Through vxassist, one may shrink and grow filesystems and their underlying volumes with minimal trouble. VxVM provides online RAID relayout. If you have a RAID5 and want to turn it into a RAID10, no problem, no downtime. If you need more space, just convert it back to a RAID5. VxVM has a reputation for being cryptic, and to some extent it is, but it’s not so bad and the flexibility is impressive.

VxFS is a fast, extent based, journaled, clusterable filesystem. In fact, it essentially introduced these features to the world, along with direct IO. Newer versions of VxFS and VxVM have the ability to do cross-platform disk sharing. If you ever wanted to unmount a volume from your AIX box and mount it on Linux or Solaris, now you can.

VxFS and VxVM are still closed source. A version is available from Symantec that is free on small servers, with limitations, but I imagine that most users still pay. Pricing starts around $2500 and can be shocking for larger machines. VxFS and VxVM are solid choices for critical infrastructure workloads, including databases.

SAM and QFS

SAM and QFS

SAM and QFS are different things but are closely coupled. QFS is Sun’s cluster filesystem, meaning that the same filesystem may be simultaneously mounted by multiple systems. SAM is a hierarchical storage manager; it allows a set of disks to be used as a cache for a tape library. SAM and QFS are designed to work together, but each may be used separately.

QFS has some interesting features. A QFS filesystem may span multiple disks with no extra LVM needed to do striping or concatenation. When multiple disks are used, data may be striped or round-robined. Round-robin allocation means that each file is written to one or two disks in the set. This is useful since, unlike striping, participation by all disks is not needed to fetch a file – each disk may seek totally independently. QFS also allows metadata to be separated from data. In this way, a few disks may serve the random metadata workload while the rest serve a sequential data workload. Finally, as mentioned before, QFS is an asymmetric cluster filesystem.

QFS cannot manage its own RAID, besides striping. For this, you need a hardware controller, a traditional volume manager, or a raw ZFS volume.

SAM makes a much larger backing store (typically a tape library) look like a regular UNIX filesystem. This is accomplished by storing metadata and often-referenced data on disk, and migrating infrequently used data in and out of the disk cache as needed. SAM can be configured so that all data is staged out to tape, so that if the disk cache fails, the tapes may be used like a backup. Files staged off of the disk cache are stored in tar-like archives, so that potentially random access of small files can become sequential. This can make further backups much faster.

QFS may be used as a local or cluster filesystem for large-file intensive workloads like Oracle. SAM and QFS are often used for huge data sets such as those encountered in supercomputing. SAM and QFS are optional products and are not cheap, but they have recently been released into OpenSolaris.

UNIX File System – UFS

UFS

UFS in its various forms has been with us since the days of BSD on VAXen the size of refrigerators. The basic UFS concepts thus date back to the early 1980s and represent the second pass at a workable UNIX filesystem, after the very slow and simple filesystem that shipped with the truly ancient Version 7 UNIX. Almost all commercial UNIX OSs have had a UFS, and ext3 in Linux is similar to UFS in design. Solaris inherited UFS through SunOS, and SunOS in turn got it from BSD.

Until recently, UFS was the only filesystem that shipped with Solaris. Unlike HP, IBM, SGI, and DEC, Sun did not develop a next-generation filesystem during the 1990s. There are probably at least two reasons for this: most competitors developed their new filesystems using third party code which required per-system royalties, and the availability of VxFS from Veritas. Considering that a lot of the other vendors’ filesystem IP was licensed from Veritas anyway, this seems like a reasonable decision.

Solaris 10 can only boot from a UFS root filesystem. In the future, ZFS boot will be available, as it already is in OpenSolaris. But for now, every Solaris system must have at least one UFS filesystem.

UFS is old technology but it is a stable and fast filesystem. Sun has continuously tuned and improved the code over the last decade and has probably squeezed as much performance out of this type of FS as is possible. Journaling support was added in Solaris 7 at the turn of the century and has been enabled by default since Solaris 9. Before that, volume level journaling was available. In this older scheme, changes to the raw device are journaled, and the filesystem is not journaling-aware. This is a simple but inefficient scheme, and it worked with a small performance penalty. Volume level journaling is now end-of-lifed, but interestingly, the same sort of system seems to have been added to FreeBSD recently. What is old is new again.

UFS is accompanied by the Solaris Volume Manager, which provides perfectly servicible software RAID.

Where does UFS fit in in 2008? Besides booting, it provides a filesystem which is stable and predictable and better integrated into the OS than anything else. ZFS will probably replace it eventually, but for now, it is a good choice for databases, which have usually been tuned for a traditional filesystem’s access characteristics. It is also a good choice for the pathologically conservative administrator, who may not have an exciting job, but who rarely has his nap time interrupted.

Zeta File System – ZFS

ZFS

ZFS has gotten a lot of hype. It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves. ZFS is not a magic bullet, but it is very cool. I like to think that if UFS and ext3 were first generation UNIX filesystems, and VxFS and XFS were second generation, then ZFS is the first third generation UNIX FS.

ZFS is not just a filesystem. It is actually a hybrid filesystem and volume manager. The integration of these two functionalities is a main source of the flexibility of ZFS. It is also, in part, the source of the famous “rampant layering violation” quote which has been repeated so many times. Remember, though, that this is just one developer’s aesthetic opinion. I have never seen a layering violation that actually stopped me from opening a file.

Being a hybrid means that ZFS manages storage differently than traditional solutions. Traditionally, you have a one to one mapping of filesystems to disk partitions, or alternately, you have a one to one mapping of filesystems to logical volumes, each of which is made up of one or more disks. In ZFS, all disks participate in one storage pool. Each ZFS filesystem has the use of all disk drives in the pool, and since filesystems are not mapped to volumes, all space is shared. Space may be reserved, so that one filesystem can’t fill up the whole pool, and reservations may be changed at will. However, if you don’t want to decide ahead of time how big each filesystem needs to be, there is no need to, and logical volumes never need to be resized. Growing or shrinking a filesystem isn’t just painless, it is irrelevant.

ZFS provides the most robust error checking of any filesystem available. All data and metadata is checksummed (SHA256 is available for the paranoid), and the checksum is validated on every read and write. If it fails and a second copy is available (metadata blocks are replicated even on single disk pools, and data is typically replicated by RAID), the second block is fetched and the corrupted block is replaced. This protects against not just bad disks, but bad controllers and fibre paths. On-disk changes are committed transactionally, so although traditional journaling is not used, on-disk state is always valid. There is no ZFS fsck program. ZFS pools may be scrubbed for errors (logical and checksum) without unmounting them.

The copy-on-write nature of ZFS provides for nearly free snapshot and clone functionality. Snapshotting a filesystem creates a point in time image of that filesystem, mounted on a dot directory in the filesystem’s root. Any number of different snapshots may be mounted, and no separate logical volume is needed, as would be for LVM style snapshots. Unless disk space becomes tight, there is no reason not to keep your snapshots forever. A clone is essentially a writable snapshot and may be mounted anywhere. Thus, multiple filesystems may be created based on the same dataset and may then diverge from the base. This is useful for creating a dozen virtual machines in a second or two from an image. Each new VM will take up no space at all until it is changed.

These are just a few interesting features of ZFS. ZFS is not a perfect replacement for traditional filesystems yet – it lacks per-user quota support and performs differently than the usual UFS profile. But for typical applications, I think it is now the best option. Its administrative features and self-healing capability (especially when its built in RAID is used) are hard to beat.

and for bigadmin on Sun