FreeBSD ZFS: Advanced format (4k) drives and you

Historically, hard drives have had a sector size of 512 bytes. This changed when drives became large enough for such a small sector size to make the overhead of keeping track of these sectors consume too much storage space, making hard drives more expensive to produce than strictly necessary. Many modern drives are tagged as “advanced format” drives; Right now, this means they have a sector size of 4096 bytes (4KiB). This includes most if not all SSDs, and most 2TB+ magnetic drives.

If you create a partition on such a drive without ensuring the partition begins on a physical sector, the device firmware will have to do some “magic” which takes more time than not doing the magic in the first place, resulting in reduced performance. It is therefore important to make sure you align partitions correctly on these devices. I generally align partitions to the 1MiB mark for the sake of being future proof. Even though my current drives have 512B and 4KiB sector sizes, I don’t want to encounter any problems when larger sector sizes are introduced.

Although ZFS can use entire devices without partitioning, I use GPT to partition and label my drives. My labels are generally reference to physical location in the server. For example, Bay1.2 would mean the drive is located in bay one slot two. This makes it so much easier to figure out which drive to replace when the need arise.

The Problem

ZFS is smart enough to query the underlying device to see how large its sectors are, and use this information to determine the size of its dynamic-width stripes. This is all fine and dandy for as long as the hardware isn’t lying. Sadly, hardware currently lie more often than not. My drives claim to have a logical sector size of 512 bytes (ashift=9 because 2^9=512) while the physical sectors are 4Kib (ashift=12). As such, ZFS will make stripes aligned to 512 bytes. This means stripes will almost always be non-aligned, forcing the underlying device to work its magic which in turn degrades performance.

ZFS does not currently seem to have any way of manually configuring the underlying block size, which means we’ll have to apply a workaround if the drives are lying.

The Workaround

Update 2014-11-23: As of FreeBSD 10.1-RELEASE, there is a new sysctl to force the ashift value of new vdevs:
The zfs(8) filesystem has been updated to allow tuning the minimum “ashift” value when creating new top-level virtual devices (vdevs). To set the minimum ashift value, for example when creating a zpool(8) on “Advanced Format” drives, set the vfs.zfs.min_auto_ashift sysctl(8) accordingly. [r266122]

Example:

# Enforce an ashift of at least 12, meaning at least 4KiB blocks
sysctl vfs.zfs.min_auto_ashift=12
# Create your zpool, or add new vdevs, as you normally would.

Original entry continues below.

On FreeBSD, you have to create a virtual device which informs ZFS its sector size is that of the physical sector size. The following is exactly how I set up the pool on my prototyping server named Lou – including partitioning. My drives have 4KiB sector size. Your mileage may vary!

# Create the gpt structure on the drives.
# Data drives:
gpart create -s gpt ada0
gpart create -s gpt ada1
gpart create -s gpt ada2

# Create partitions on data drives
gpart add -a 1m -t freebsd-zfs -l Bay1.1 ada0
gpart add -a 1m -t freebsd-zfs -l Bay1.2 ada1
gpart add -a 1m -t freebsd-zfs -l Bay1.3 ada2

# Create virtual devices which define 4K sector size
gnop create -S 4k gpt/Bay1.1
gnop create -S 4k gpt/Bay1.2
gnop create -S 4k gpt/Bay1.3

# Create the pool and define some general settings:
zpool create LouTank raidz /dev/gpt/Bay1.1.nop /dev/gpt/Bay1.2.nop /dev/gpt/Bay1.3.nop
zfs set atime=off LouTank
zfs set checksum=fletcher4 LouTank

# Export pool and remove virtual devices
zpool export LouTank
gnop destroy gpt/Bay1.1.nop
gnop destroy gpt/Bay1.2.nop
gnop destroy gpt/Bay1.3.nop
# Import pool. Tell zpool to look for devices in /dev/gpt, in order to keep labels.
zpool import -d /dev/gpt LouTank

7 thoughts on “FreeBSD ZFS: Advanced format (4k) drives and you

    • If the advanced format drives suffer a performance penalty with 512B block size (ashift=9), they would drag down the performance of that vdev when formatted as such. If that’s a real setup, I’d recommend testing if your particular drives suffer this problem before rebuilding the vdev.

      However, a HDD which is not advanced format and actually has 512B sector size, won’t suffer performance penalties with a 4k formatting (ashift=12), because it’s a multiple of 512.

      Do note that larger block size may cause more ‘dead’ space if you have a lot of files which are smaller than said block size.

      Like

  1. Hey, question: I’m a neophyte FreeBSD user (on PCBSD, actually); I’m an RHCE looking at BSD for ZFS and to learn something new.

    I’ve set up a 8x3TB zraid2 (6x drives usable) a few months ago, and just tore the whole thing down, upgrading to 9.1RC3. FreeBSD RC3 was talking about “automatic 4K blocksize enabled” and I was trying to figure out exactly what that did when I found your site.

    Reading your comments above — one thing I liked (have to find my notes where it originated) was as well as orientating the slice on a 1M boundary, this guy _also_ specified a slice smaller than the actual physical disk. The rationale was that not all xGB specfiied size drives are truly xGB; some are actually slightly smaller by a few sectors. So *IF* you manage to replace a drive with a “same size but smaller” drive, you’re busted.

    He created an 1M oriented slice of not-quite-the-entire disk, with the expectation that on a physical swap he’s got a good chance of actually being successful. Just wondered why you didn’t take this approach — is it that your prototyping server is easily scratchable? Or that you don’t really expect to have drive failures?

    I’ve got my drives position-labeled too, makes it easy. But I was also wondering why you deleted the gnop virtual entries after creating the pool? Why not leave them around? Do they get in the way somehow??

    Also, guess I’m evil: I *LIKE* atime. Don’t care that it slows things down, I *want* to know activity times, even if it’s just me! :-)

    Thanks!

    Like

    • Hello, and thank you for your comment.

      To be honest, I had not considered that drives could have varying physical size when tagged with the same size. It’s not a problem I have encountered yet, but I’ll definitely do some research on this.

      As for the .nop devices, they get automatically destroyed on reboot. As such, I destroyed them before importing the pool, to make sure the pool used the correct devices (gpt labels) and not the .nop devices, to avoid any inconvenience with the pool being populated with the unnamed counterparts in /dev upon next reboot.

      Like

Leave a comment