ZFS has gotten a lot of hype. It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves. ZFS is not a magic bullet, but it is very cool. I like to think that if UFS and ext3 were first generation UNIX filesystems, and VxFS and XFS were second generation, then ZFS is the first third generation UNIX FS.
ZFS is not just a filesystem. It is actually a hybrid filesystem and volume manager. The integration of these two functionalities is a main source of the flexibility of ZFS. It is also, in part, the source of the famous “rampant layering violation” quote which has been repeated so many times. Remember, though, that this is just one developer’s aesthetic opinion. I have never seen a layering violation that actually stopped me from opening a file.
Being a hybrid means that ZFS manages storage differently than traditional solutions. Traditionally, you have a one to one mapping of filesystems to disk partitions, or alternately, you have a one to one mapping of filesystems to logical volumes, each of which is made up of one or more disks. In ZFS, all disks participate in one storage pool. Each ZFS filesystem has the use of all disk drives in the pool, and since filesystems are not mapped to volumes, all space is shared. Space may be reserved, so that one filesystem can’t fill up the whole pool, and reservations may be changed at will. However, if you don’t want to decide ahead of time how big each filesystem needs to be, there is no need to, and logical volumes never need to be resized. Growing or shrinking a filesystem isn’t just painless, it is irrelevant.
ZFS provides the most robust error checking of any filesystem available. All data and metadata is checksummed (SHA256 is available for the paranoid), and the checksum is validated on every read and write. If it fails and a second copy is available (metadata blocks are replicated even on single disk pools, and data is typically replicated by RAID), the second block is fetched and the corrupted block is replaced. This protects against not just bad disks, but bad controllers and fibre paths. On-disk changes are committed transactionally, so although traditional journaling is not used, on-disk state is always valid. There is no ZFS fsck program. ZFS pools may be scrubbed for errors (logical and checksum) without unmounting them.
The copy-on-write nature of ZFS provides for nearly free snapshot and clone functionality. Snapshotting a filesystem creates a point in time image of that filesystem, mounted on a dot directory in the filesystem’s root. Any number of different snapshots may be mounted, and no separate logical volume is needed, as would be for LVM style snapshots. Unless disk space becomes tight, there is no reason not to keep your snapshots forever. A clone is essentially a writable snapshot and may be mounted anywhere. Thus, multiple filesystems may be created based on the same dataset and may then diverge from the base. This is useful for creating a dozen virtual machines in a second or two from an image. Each new VM will take up no space at all until it is changed.
These are just a few interesting features of ZFS. ZFS is not a perfect replacement for traditional filesystems yet – it lacks per-user quota support and performs differently than the usual UFS profile. But for typical applications, I think it is now the best option. Its administrative features and self-healing capability (especially when its built in RAID is used) are hard to beat.
and for bigadmin on Sun