Updated October 19, 2022.
Today’s question comes from Jeff….
Q. What drives should I buy for my ZFS server?
Answer: Here’s what I recommend, considering a balance of cost per TB, performance, and reliability. I prefer NAS or Enterprise-class drives since they are designed to run 24/7 and also are better at tolerating vibration from other drives in close proximity. I prefer SATA but SAS drives would be better in some designs (especially when using expanders).
For a ZFS home or small business TrueNAS, FreeNAS, or Proxmox storage server, I think these are the best options, and I’ve also included some enterprise-class drives.
Seagate IronWolf – 4TB, 6TB, 8TB, 10TB, 12TB, 14TB, 16TB, 18TB, 20TB drives
Seagate had a bad reputation because of high failure rates with their 2TB in the past. But that was the past. Now WD is the brand with shenanigans. The newer offerings are more reliable and given the decent warranty and competitive prices they’re worth another look. Seagate has 3 product lines suitable for ZFS, all running at 7200RPM:
Seagate IronWolf NAS (up to 16TB) are NAS class drives targeted at smaller deployments.
- 5900-7200RPM
- Supported in up to 8 drive bays
- Workload: 180TB/year
- 3-year warranty
Seagate IronWolf Pro (up to 20TB)
- 7200RPM
- Supported in configurations up to 16-24 bays
- Workload: 300TB/year
- 5-year warranty
Seagate Exos x16-x20 (up to 20TB) is the data center offering designed for enterprise workloads. Currently, these have a great price per TB.
- 7200RPM
- Supports unlimited bays
- Workload: 550TB/year
- 5-year warranty
Toshiba Drives (4TB, 6TB, 8TB, 10TB, 12TB, 14TB)
Toshiba N300 NAS Systems (up to 14TB) is their NAS class offering.
- 7200RPM
- Supports up to 8 bays
- Workload: 180TB/year
- 3-year warranty
Toshiba MG Series (up to 14TB) is their Enterprise high capacity offering.
- 7200RPM
- Supports unlimited bays
- Workload: 550TB/year
- 5-year warranty
Western Digital up to 18TB
Some of the most popular NAS class drives on the market today are made by Western Digital. They bought out one of the most respected drive companies, HGST. The 4 product lines are:
Update: 2020-10-27. Warning: Western Digital silently switched some WD Reds to SMR (Shingled Magnetic Recording). Doing this on a drive designed for a NAS is awful, an unsuspecting customer could easily lose data due to the long rebuilds. SMR essentially means the tracks are packed so tightly that data writes require a erase/rewrite cycle of data stored nearby which significantly slows down writes. This is bad for RAID (imagine trying to rebuild a 14TB SMR drive!) and especially bad for copy-on-write filesystems like ZFS. When you buy a WD RED there is a bit of confusion on whether you’re getting a CMR or SMR drive while stores work through inventory. WD Reds 8TB and above are still CMR (as far as I know) so should be okay. Just make sure the seller specifically says it’s a CMR drive, especially if you go under 4TB. The WD Red Plus and Pro line is still good.
WD Red (DO NOT USE, newer models are SMR, you want CMR). Use the WD Red Plus or Pro models.
WD Red Plus (up to 14TB) designed for NAS up to 8-bays.
- 7200RPM
- Supported in up to 8 drive bays
- Workload: 180TB/Year
- 3-year warranty
WD Red Pro (up to 18TB) designed for larger deployments suitable for small/medium businesses.
- 7200RPM
- Supported in up to 24 drive bays
- Workload: 300TB/year
- 5-year warranty
WD Gold (up to 18TB) is an Enterprise-class drive. It’s only available in SATA and has a more simple lineup than the Ultrastar.
- 7200RPM
- Unlimited drive bays
- Workload: 550TB/year
- 5-year warranty
WD Ultrastar (up to 14TB) Datacenter class hard drives designed for heavy workloads (this line was going to replace WD Golds, but then WD changed their mind and now has two slightly different enterprise product lines).
- 7200RPM
- Supported in unlimited drive bays
- Workload: 550TB/year
- 5-year warranty
Buying Tips:
- If you read reviews about failures, I discount negative reviews with DOAs or drives that fail within the first few days. You’re going to be able to return those rather quickly. What you want to avoid is a drive that fails a year or two in and have the hassle of dealing with a warranty claim.
- Higher RPMs and larger disks are typically going to have faster seek times.
- Gone are the days when you need a 24-bay server for large amounts of storage. It’s far simpler to get a 4-bay chassis with 16TB drives, giving you 64TB RAW. If you don’t need more capacity or IOPS keep it simple.
Or buy a TrueNAS Storage Server from iXsystems
I’m cheap and tend to go with a DIY approach most of the time, but when I’m recommending ZFS systems in environments where availability is important, I like the TrueNAS servers from iX Systems which can of course come with drives in configurations that have been well tested. The prices on a TrueNAS are very reasonable compared to other storage systems and are barely more than a DIY setup. And of course, for a small server, you can grab a 4 or 5-bay TrueNAS Mini.
Careful with “archival” drives
If you don’t get one of the drives above, some larger hard drives are using SMR (Shingled Magnetic Recording) which should not be used with ZFS if you care about performance until drivers are developed. Be careful about any drive that says it’s for archiving purposes. You always want CMR (Conventional Magnetic Recording).
The ZIL / SLOG and L2ARC
The ZFS Intent Log (ZIL) should be on an SSD with a battery-backed capacitor that can flush out the cache in case of a power failure. I have done quite a bit of testing and like the Intel DC SSD series drives and also HGST’s S840Z. These are rated to have their data overwritten many times and will not lose data on power loss. These run on the expensive side, so for a home setup, I typically try to find them used on eBay. From a ZIL perspective, there’s not a reason to get a large drive–but keep in mind, you get better performance with larger drives. In my home, I use 100GB DC S3700s and they do just fine.
I generally don’t use an L2ARC (SSH read cache) and instead opt to add more memory. There are a few cases where an L2ARC makes sense when you have very large working sets.
For SLOG and L2ARC see my comparison of SSDs.
Capacity Planning for Failure
Most drives running 24/7 start having a high failure rate after 5-years, you might be able to squeeze 6 or 7 years out of them if you’re lucky. So a good rule of thumb is to estimate your growth and buy drives big enough that you will start to outgrow them in 5+ years. The price of hard drives is always dropping so you don’t really want to buy more much than you’ll need before they start failing. Consider that in ZFS you shouldn’t run more than 70% full (with 80% being the max) for your typical NAS applications including VMs on NFS. But if you’re planning to use iSCSI you shouldn’t run more than 50% full.
ZFS Drive Configurations
My preference at home is almost always RAID-Z2 (RAID-6) with 6 to 8 drives which provides a storage efficiency of .66 to .75. This scales pretty well as far as capacity is concerned and with double-parity, I’m not that concerned if a drive fails. For larger setups use multiple RAID-Z2 vdevs. E.g. with 60 bays use 10 six drive RAID-Z2 vdevs (each vdev will increase IOPS). For smaller setups I run 4 to 5 drives in RAID-Z (RAID-5). In all cases it’s essential to have backups… and I’d rather have two smaller servers with RAID-Z mirroring to each other than one server with RAID-Z2. The nice thing about smaller setups is the cost of upgrading 4 drives isn’t as bad as 6 or 8! For enterprise setups, especially VMs and databases, I like ZFS mirrored pairs (RAID-10) for fast rebuild times and performance at a storage efficiency of 0.50.
Enabling CCTL/TLER on Desktop Drives
Time-Limited Error Recovery (TLER) or Command Completion Time Limit (CCTL).
Do not do this! But if you must run desktop drives… On desktop class drives such as the HGST Deskstar, they’re typically not run in RAID mode so by default they are configured to take as long as needed (sometimes several minutes) to try to recover a bad sector of data. This is what you’d want on a desktop, however, performance grinds to a halt during this time which can cause your ZFS server to hang for several minutes waiting on recovery. If you already have ZFS redundancy, it’s a pretty low risk just to tell the drive to give up after a few seconds, and let ZFS rebuild the data.
The basic rule of thumb. If you’re running RAID-Z, you have two copies so I’d be a little cautious about enabling TLER. If you’re running RAID-Z2 or RAID-Z3 you have three or four copies of data so in that case there’s very little risk in enabling it.
Viewing the TLER setting:
# smartctl -l scterc -d sat /dev/da8 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: Disabled Write: Disabled
Enabling TLER
# smartctl -l scterc,70,70 -d sat /dev/da8 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control set to: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds)
Disabling TLER
(TLER should always be disabled if you have no redundancy).
# smartctl -l scterc,0,0 -d sat /dev/da8 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control set to: Read: Disabled Write: Disabled
Is there a benefit in performance to sticking with 512n (native not emulated sector size) disks.
Yes I know you can do 4K alignment with “ashift” but. I believe you will incur lest overhead for small update but have never benchmarked this assumption.
In the past I always try use 512n when possible recently becuase of this the 4TB 512N ST4000NM0033 (128MB cache -SATA) as my goto large drive about $56 to $75 per TB
Hi, Jon. I haven’t tested native vs emulated sector disks. I am curious so if you come across any benchmarks let me know.
I have a 6-disk RAID-Z2 of 2TB Seagate Barracudas on my main pool and I haven’t been very happy with them. Now these are not enterprise grade because I didn’t know any better when I first built my pool. Last year I had two fail out of warranty. A couple a weeks ago another one failed, this time I got smart and replaced it with an HGST. Of the three I haven’t replaced 2 are reporting SMART issues and the other one is corrupting data (ZFS keeps reporting read errors on and on every scrub I see it being resilvered). That’s a 100% failure rate within 4 years. So I went ahead and ordered 5 more HGST 7K4000s to replace them all and hopefully be done with it.
The 4TB Seagates are a lot more reliable than 2TB. I feel like if Seagate wants more of my business they should be giving me a partial credit on those 6 drives since they didn’t last very long. Last week I sat down to write a letter to Seagate to see if they’d be interested in earning back my business by giving me a partial credit or a couple of free drives but they don’t have a mailing address listed on their website.
Ben,
I always try to buy with a five (5) year warranty and have a spare drive lying around (even if I’m burning the warranty). For that matter I always buy computers and servers in threes just so I have some “swap” the part capability. Sorry about your experience with Seagate.
I too was less than enthused about ZFS RAID (I stopped using them years ago and went to mirrors) at first they seemed to perform like “magic” but then after they got to 70% full or above they really started to crawl (due to inherent COW fragmentation). The lousy thing is you can really get rid of it is as there is no (and probably never will be) block pointer rewrite (BPR). Once I had the pain of dealing with a Sun X4500 “thumper” with two 10 disk ZFS RAIDs which became fragmented and useless – I had to move the data off chassis rebuild smaller pools and then move the data back (and that took a long long time). That was the final straw which caused me to switched to mirrored pairs and I keep them relatively empty if I want to refresh performance I copy the data to a freshly minted mirrored pair. In general I’m pretty happy with this arraignment.
As to benchmarks I did a little more digging (no benchmarks on the 512 v. 4K sector by me):
This is not my benchmark but viable for your use case of RAID-Z2 seems like you gain some space with ashift=9 BUT loose 1/3 of your write performance at least in this test (I always use mirrors) http://louwrentius.com/zfs-performance-and-capacity-impact-of-ashift9-on-4k-sector-drives.html
This also is not my benchmark – albeit for windows – the big issue here is miss alignment – the unaligned setup results in a performance impact of up to 50%, depending on the workload. However workloads that do not involve write operations, such as the Web server test pattern, don’t show any disadvantage at all in I/O testing.
In PCMark Vantage application test shows decreased performance for the new drive with its 4 KB sector size in popular Windows scenarios. … However, we cannot control the way data writes are actually executed and organized. In the case of PCMark Vantage, the benchmark was never tweaked to minimize the number of smallest-size write requests in favor of larger chunks of data (this is something ZFS does). http://www.tomshardware.com/reviews/advanced-format-4k-sector-size-hard-drive,2759.html
FYI in ZFS RAID scenarios (which I stopped using years ago) If a 4k bytes is the minimum block of data that can be written or read, data blocks smaller than 4k will also be padded to form the 4k. This article says that this can hurt the most, when parity blocks have to be created for small chunks of data. http://www.docs.cloudbyte.com/knowledge-base-articles/implications-of-using-4k-sector-size-zfs-pools/
a) So as a thumb-rule the ZFS record size should be multiple of number of data disks in the RAIDZ x Sector Size.
b) For example for 4+1, the record size should be 16K (4 x 4096) and for 2+1 it should be 8K (2 x 4096).
c) CloudByte recommends 32K as the sector size for least space overhead.
BTW a nice article/tutorial about forcing the ashift (9 or 12) value http://blog.delphix.com/gwilson/2012/11/15/4k-sectors-and-zfs/
Summary: Seems like maybe I should bite the bullet and switch to 4K drives and always make sure I use ashift=12 in my new rigs – as performance seems to be similar if the drives are aligned.
Well, you’re wise to get a 5-year warranty. Mine were bought shortly after the Thailand flooding so that may have played a role in the shorter lifespan.
You’ve done quite a bit of research there. For heavy I/O I agree that mirrors are the way to go, mostly for the extra IOPS. At home I use 6 x RAID-Z2 any my performance is fine for my needs… I have a few VMware VMs on NFS but the ARC and SLOG is more than enough to run it on RAID-Z2–but the bulk of my data is movies and pictures.
I think those articles on the record size might be a little dated. Now days everyone runs with compression=on (which for FreeNAS and OmniOS is LZ4) which changes the theory on record sizes and keeping the data disks in powers of 2 for RAID-Z/Z2/Z3. Here’s some newer articles you may find interesting:
Matt makes the point that you don’t need to worry about the record size being a multiple of data disks: See: http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ With LZ4 compression (which at worst doesn’t hurt and at best saves space and increases performance) it doesn’t really matter that much since most of the time it will compress into a smaller record anyway.
Max Bruning has an interesting article about how parity works with RAID-Z. https://www.joyent.com/blog/zfs-raidz-striping He states that in the case of a small write, RAID-Z will only put the data on enough disks to get the required redundancy. So if you have 4+1 RAID-Z and write a block less than 4K (assuming ashift-12) RAID-Z will place the data on only 2 disks (effectively mirroring the data) instead of wasting space on all 5 drives.
So correct me if I’m interpreting this wrong, but I think the best record size from a storage efficiency standpoint is the largest possible (1M). Smaller writes are going to use the smallest record-size they can anyway. Using a large
block-sizerecord-size will help throughput but could come at the cost of IOPS, so there may be performance reasons to force a smaller record-size.I talked to somebody from iXsystems … and apparently they are using 4K Native sector drives, specifically HGST they told me they provide the best performance.
Also … I had some major issues with recent WD RED drives … they would timeout and ABRT block read commands randomly on high usage, so i switched to HGST NAS for now. But HGST also has the HE series drives with 4KN and the non HE 4KN drives also available.
Could you provide some examples of ‘archival’ drives? Would you consider WD Red as an archival drive?
The Seagate ST8000AS0002 is a good example of what you want to stay away from. SMR does not play well with CoW filesystems like ZFS.
WD Reds are not archive drives and will work great with ZFS.
Update for 2018? HE6 – HE12 HGST Helium drives, are revolutionary, come in both SAS/SATA interface, longer life, less power, less heat, better speed than most competing models, ~$28/tb. Backblaze is marking leading reliability ratings for these. For ZIL, the Intel Optane 900P is a new winner for budget users, it has no cache, so no need to worry about cache power Caps, this means writes are direct to media always. Like your posts, bless you and yours.