• 5 posts
  • 39 comments
Joined 3 years ago
Cake day: June 4th, 2023
  • Being poor sucks.

    No kidding. This didn’t ruin me but it was painful. And on top of that, it’s smaller disks for significantly more than I did last year.

    I bought them for years ahead. My primary pool is gonna start having drives fail from age and use, and it’ll need replacements. So I’m gonna be sticking these in as they fail over time so I use all the service life I can get from the existing (8TB) drives.

  • B350 isn’t a very fast chipset to begin with

    For sure.

    I’m willing to bet the CPU in such a motherboard isn’t exactly current-gen either.

    Reasonable bet, but it’s a Ryzen 9 5950X with 64GB of RAM. I’m pretty proud of how far I’ve managed to stretch this board. 😆 At this point I’m waiting for blown caps, but the case temp is pretty low so it may end up trucking along for surprisingly long time.

    Are you sure you’re even running at PCIe 3.0 speeds too?

    So given the CPU, it should be PCIe 3.0, but that doesn’t remove any of the queues/scheduling suspicions for the chipset.

    I’m now replicating data out of this pool and the read load looks perfectly balanced. Bandwidth’s fine too. I think I have no choice but to benchmark the disks individually outside of ZFS once I’m done with this operation in order to figure out whether any show problems. If not, they’ll go in the spares bin.

  • I put the low IOPS disk in a good USB 3 enclosure, hooked to an on-CPU USB controller. Now things are flipped:

                                            capacity     operations     bandwidth 
    pool                                  alloc   free   read  write   read  write
    ------------------------------------  -----  -----  -----  -----  -----  -----
    storage-volume-backup                 12.6T  3.74T      0    563      0   293M
      mirror-0                            12.6T  3.74T      0    563      0   293M
        wwn-0x5000c500e8736faf                -      -      0    406      0   146M
        wwn-0x5000c500e8737337                -      -      0    156      0   146M
    

    You might be right about the link problem.

    Looking at the B350 diagram, the whole chipset is hooked via PCIe 3.0 x4 link to the CPU. The other pool (the source) is hooked via USB controller on the chipset. The SATA controller is also on the chipset so it also shares the chipset-CPU link. I’m pretty sure I’m also using all the PCIe links the chipset provides for SSDs. So that’s 4GB/s total for the whole chipset. Now I’m probably not saturating the whole link, in this particular workload, but perhaps there’s might be another related bottleneck.

  • Turns out the on-CPU SATA controller isn’t available when the NVMe slot is used. 🫢 Swapped SATA ports, no diff. Put the low IOPS disk in a good USB 3 enclosure, hooked to an on-CPU USB controller. Now things are flipped:

                                            capacity     operations     bandwidth 
    pool                                  alloc   free   read  write   read  write
    ------------------------------------  -----  -----  -----  -----  -----  -----
    storage-volume-backup                 12.6T  3.74T      0    563      0   293M
      mirror-0                            12.6T  3.74T      0    563      0   293M
        wwn-0x5000c500e8736faf                -      -      0    406      0   146M
        wwn-0x5000c500e8737337                -      -      0    156      0   146M
    
  • Interesting. SMART looks pristine on both drives. Brand new drives - Exos X22. Doesn’t mean there isn’t an impending problem of course. I might try shuffling the links to see if that changes the behaviour on the suggestions of the other comment. Both are currently hooked to an AMD B350 chipset SATA controller. There are two ports that should be hooked to the on-CPU SATA controller. I imagine the two SATA controllers don’t share bandwidth. I’ll try putting one disk on the on-CPU controller.

I’m syncoiding from my normal RAIDz2 to a backup mirror made of 2 disks. I looked at zpool iostat and I noticed that one of the disks consistently shows less than half the write IOPS of the other:

                                        capacity     operations     bandwidth 
pool                                  alloc   free   read  write   read  write
------------------------------------  -----  -----  -----  -----  -----  -----
storage-volume-backup                 5.03T  11.3T      0    867      0   330M
  mirror-0                            5.03T  11.3T      0    867      0   330M
    wwn-0x5000c500e8736faf                -      -      0    212      0   164M
    wwn-0x5000c500e8737337                -      -      0    654      0   165M

This is also evident in iostat:

     f/s f_await  aqu-sz  %util Device
    0.00    0.00    3.48  46.2% sda
    0.00    0.00    8.10  99.7% sdb

The difference is also evident in the temperatures of the disks. The busier disk is 4 degrees warmer than the other. The disks are identical on paper and bought at the same time.

Is this behaviour expected?

    • Lenovo ThinkCentre / Dell OptiPlex USFF machine like the M710q.
    • Secondary NVMe or SATA SSD for a RAID1 mirror
      • Use LVMRAID for this. It uses mdraid underneath but it’s easier to manage
    • External USB disks for storage
      • WD Elements generally work well when well ventilated
      • OWC Mercury Elite Pro Quad has a very well implemented USB path and has been problem-free in my testing
    • Debian / Ubuntu LTS
    • ZFS for the disk storage
    • Backups may require a second copy or similar of this setup so keep that in mind when thinking about the storage space and cost

    Here’s a visual inspiration:

I built a 5x 16TB RAIDz2, filled it with data, then I discovered the following.

Sequentially reading a single file from the file system gave me around 40MB/s. Reading multiple in parallel brought the total throughput in the hundreds of megabytes - where I’d expect it. This is really weird. The 5 disks show 100% utilization during single file reads. Writes are supremely fast, whether single threaded or parallel. Reading directly from each disk gives >200MB/s.

Splitting the the RAIDz2 into two RAIDz1s, or into one RAIDz1 and a mirror improved reads to 100 and something MB/s. Better but still not where it should be.

I have an existing RAIDz1 made of 4x 8TB disks on the same machine. That one reads with 250-350MB/s. I made an equivalent 4x 16TB RAIDz1 from the new drives and that read with about 100MB/s. Much slower.

All of this was done with ashift=12 and default recordsize. The disks’ datasheets say their block size is 4096.

I decided to try RAIDz2 with ashift=13 even though the disks really say they’ve got 4K physical block size. Lo and behold, the single file reads went to over 150MB/s. 🤔

Following from there, I got full throughput when I increased the recordsize to 1M. This produces full throughput even with ashift=12. My existing 4x 8TB RAIDz1 pools with ashift=12 and recordsize=128K read single files fast.

Here’s a diff of the queue dump of the old and new drives. The left side is a WD 8TB from the existing RAIDz1, the right side is one of the new HC550 16TB

< max_hw_sectors_kb: 1024
---
> max_hw_sectors_kb: 512
20c20
< max_sectors_kb: 1024
---
> max_sectors_kb: 512
25c25
< nr_requests: 2
---
> nr_requests: 60
36c36
< write_cache: write through
---
> write_cache: write back
38c38
< write_zeroes_max_bytes: 0
---
> write_zeroes_max_bytes: 33550336

Could the max_*_sectors_kb being half on the new drives have something to do with it?


Can anyone make any sense of any of this?

  • I think I’ve seen this hypothesis too and it makes sense to me.

    If I’m building a new AMD system today, I’d look for a board that exposes more of the chipset-provided USB ports. Otherwise I’d budget for a high quality 4-port PCIe USB controller, if I’m planning to rely a lot on USB on that system.

  • This article provides some context. Now I do have the latest firmware which should have these fixes but they don’t seem to be foolproof. I’ve seen reports around the web that the firmware improves things but doesn’t completely eliminate them.

    If you’ve seen devices disconnecting and reconnecting on occasion, it could be it.