Corrupting a ZFS File on Purpose

(oshogbo.com)

83 points | by zdw 3 days ago

7 comments

  • ralferoo 2 days ago
    Hmmm, it's been a long long time since I actually had a failed drive (and also I don't use zfs), but from what I remember of my last failing drive 20 years ago, the drive was able to detect that sectors had been corrupted, and then failed the read rather than just returning silently corrupted data. If my memory is correct, replacing random bytes on disk wouldn't actually reflect the typical way data corruption manifests itself.

    I always thought that the reason zfs did its extensive CRC checks was primarily to detect data corruption while it was in RAM or over the network, with a side effect that in the rare cares that data on disk got corrupted without the drive detecting it because the CRC was still valid, it'd also be spotted.

    But anyway, it might be worth testing by replacing some of the disk images with actually truncated ones so that there are holes when reading, so that it returns an actual read error rather than junk data.

    • adrian_b 2 days ago
      The error-correcting codes used by HDDs/SSDs correct or detect the most frequent errors, but sometimes, when there are too many erroneous bits in a sector, they can mis-correct the data and then the HDD/SSD returns a corrupted sector without signaling any error.

      I have seen this a few times on HDDs that had been used for the cold storage of archival data, for several years (around 5 years or even more). For each archive file, I had my own hash values that were used to detect corrupted files, which allowed me to detect all such cases. I had duplicates for all such HDDs. Sometimes both HDD copies had a few silent corrupted sectors, but they were not in the same locations, so in all cases I could recover the corrupted files from their duplicates. If I had stored the archival data without redundancy, I would have lost it.

      If you do not use hashes or other error-detecting codes for all your files, like I do, you may have had some failures in your HDDs without recognizing them, but such errors are much more likely to happen in files that have been stored for many years.

      • ramses0 2 days ago
        • adrian_b 15 hours ago
          Yes, already for many years, I have also used par2create/par2verify for adding redundancy to archive files and repairing any corrupted files.

          However, I use both par2create and duplicate storage media, because duplicates that are preferably stored in different geographic locations are the only solution that guards against incidents so serious that they would destroy partially or totally the storage device.

          By itself, when an adequate amount of added redundancy is chosen, par2create is sufficient to recover archive files that are only affected by a few sporadic corrupted sectors, like on a HDD that has been stored in good conditions for some years. It will not help if the entire HDD becomes unusable, due to some mechanical or electrical defect, which may happen in HDDs used for cold storage, instead of being used continuously.

        • wongarsu 17 hours ago
          Or rar files with recovery records. Same concept, but in one self-contained file instead of a number of sidecar files
    • throw0101c 17 hours ago
      > I always thought that the reason zfs did its extensive CRC checks was primarily to detect data corruption while it was in RAM or over the network, with a side effect that in the rare cares that data on disk got corrupted without the drive detecting it because the CRC was still valid, it'd also be spotted.

      Nope, it's always been about on-disk bit rot.

      First off: drive firmware has been known to return the wrong LBA data. The OS asks for 123, the drive reads 234—and verifies its drive-level CRC, which passes—and sends it up. Application gets a bundle of bits that's not correct. With ZFS, it expects a certain checksum from that part of the tree/file, and so the LBA 234 gets returned it will not match the checksum that is for 123.

      Next, if you have RAID-1, then if the drive has corrupted data, if you don't have higher-level FS checksums, how do you which mirror has the correct data? They're different, but which is correct. With ZFS you know which block has the correct checksum, return that data to application, and then use the correct data to correct the wrong one.

      • BuildTheRobots 13 hours ago
        I don't know how much better modern drives (and SSDs) have gotten[1], but as someone who started digital hoarding in the mid 90's, on-disk bitrot used to be a massive problem. The amount of my video, audio and pictures that suffered damage was palpable. ZFS offering to fix it was massive selling point and the time and based on personal experience, it delivered.

        ZFS also lets you specify number of copies on a single disk. This sounds a bit weird, but as drives suffer block failures far more often than total failures, it's actually surprisingly useful in some situations.

        [1] My suspicion is significantly, as storage sizes are now multiple orders of magnitude larger and errors per MB can't have scaled up linearly to match.

    • matja 2 days ago
      You're right that the ECC validation is very robust, but that only validates one small part - that the drive is reading what it has previously written, not that the data was correct when it came in to the drive, correctly handled by the firmware, or even written in the correct place (LBA) on the drive.

      There's been times when some features of entire models of drives have been disabled in the Linux kernel because of buggy firmware that silently writes bad data (with correct ECC), so reading it back is successful from both the drive's and the OS's block driver views.

      I was hit by this myself with the queued TRIM command firmware bug that affected all Samsung EVO 840 SSDs (Linux kernel commit 9a9324d3969678d44b330e1230ad2c8ae67acf81 if you want to look into the history) - the drive didn't report any errors, but ZFS kept reporting corruption, and kept on fixing it in the background.

    • ssl-3 10 hours ago
      > Hmmm, it's been a long long time since I actually had a failed drive (and also I don't use zfs), but from what I remember of my last failing drive 20 years ago, the drive was able to detect that sectors had been corrupted, and then failed the read rather than just returning silently corrupted data.

      That's the behavior that is desired, yes. And in a neat world of frictionless pulleys and ropes that don't stretch, perhaps that is what happens.

      In reality, the root reasoning for filesystems to detect bitrot is simpler: It's irrational to expect that a device which is already failing is going to behave in a predictable way.

  • guardiangod 14 hours ago
    I ran 5 external USB + SMR hard disks in RAIDZ 5 for 10 years. The only thing I had to change was to use Highpoint's enterprise level USB controllers- commercial USB controllers from Realtek and Renasus are junk and will drop the drives after a while.

    Even then, I had multiple cases where files were corrupted, and once the whole array refused to be online due to corrupted metadata. I had to make ZFS to replay the journal log with undocumented commands. Sometimes it takes a few days of hair-rising recovery but I always manage to get the array back intact.

    The files that are corrupted are always extremely large files (>50 GB) with many small read/writes (eg. iSCSI image files.)

    It's pretty impressive how resilient ZFS is, really, given I had what likely to be the worst possible hardware combination.

    • BuildTheRobots 13 hours ago
      Out of curiosity, why were you using a >50gb file on a dataset as as iSCSI target rather than a zvol or did I misunderstand?
      • oasisaimlessly 7 hours ago
        Why use zvols? Aren't they essentially just single-file ZFS datasets (allowing e.g. independent snapshotting)?
        • ssl-3 5 hours ago
          They are as you describe.

          Except zvols present as real-live block devices that can do block-device things instead of regular-file things, and that's important for some stuff.

          But AFAICT, iSCSI targets on Linux are not one of those things. They don't care; they work the ~same whether backed by files or block devices.

          And on the performance benchmarks I find that compare performance of zvols-vs-files on ZFS, files usually win.

          > Why use zvols?

          Probably for the same reasons that people recommended separate disk partitions for /var, /usr, and such as was the case ~30 years ago when I got started with desktop *nix systems.

          That reason seemed to boil down to: "If it was good for a Sun/3 in 1986, then it must also be good for a Linux box in 1996." It was a dumb reason.

          tl;dr, folklore. :)

          • wahern 2 hours ago
            > That reason seemed to boil down to: "If it was good for a Sun/3 in 1986, then it must also be good for a Linux box in 1996." It was a dumb reason.

            ext2 disk corruption, especially on power failure or a crash, was a common threat in the 1990s. Not merely to the point of requiring fsck and a bunch of orphaned files (which was inevitable on an unclean shutdown), but just totally fubar'd, requiring a reformat. The only thing worse was then trying to reinstall Slackware from the floppy disks, at least one of which had a better than even chance of corruption from just sitting in the drawer since the last reinstall, requiring another long night nursing a download over the 2400 baud modem.

            I use OpenBSD, and while FFS2 has been far more robust than 1990s Linux ext2, smart partitioning is still warranted, not just for minimizing blast radius, but also for managing backups, etc. I haven't had the chance to use ZFS, and it might be the only filesystem I might consider skipping partitioning for on a workhorse system, but even if you trust the design and code quality of ZFS, it's running unprotected alongside a bunch of horribly buggy kernel subsystems and drivers, so....

            • ssl-3 51 minutes ago
              You raise an interesting point. Please allow me to enhance it.

              It could get worse than reinstalling Slackware, again, from floppies. I didn't get to experience corrupted floppies; I instead had a habit of recycling my Slackware disksets for other purposes after the system was up and running. So any complete re-install started by booting up MS-DOS to run Telemate to start downloading them fresh from Sunsite...again.

              But at least it was Telemate, so I could manage files to free up more floppy disks while this process slowly continued at [I guess I was fortunate] 9600 or 14.4kbps. ;)

              I don't recall much difficulty with ext2 being fragile (though I can provide horror stories about OS/2's HPFS). If I had issues with it, they didn't leave any scars.

              But I accept your correction. It may have been the case that splitting the filesystem into different partitions made sense because ext2 was fickle, and I was just very lucky in deliberately ignoring that advice after the first time I misjudged the partition sizes at install and ran out of space in some directory or other.

              Hard drives seemed so small back then. Installing a real OS meant a serious tradeoff in the ratio between user data and system data.

              ---

              Anyway, ZFS. The ZFS way is that it owns the whole disk -- for a long time, the preferred method didn't even use partitions at all. Nowadays OpenZFS does create one partition for itself by default, but it uses the whole disk just the same.

              Blast radius is limited by having different datasets (think "filesystem-light"), and read-only snapshots, and easy, consistent backups (if you have a compatible device or service to send them to -- otherwise, it's ~the same backup dance as any other filesystem with snapshots).

              It's a different way of doing things, like a subsystem in and of itself. It keeps its own caches and generally wants to be as close to the metal as it can be. Which sounds scary, but meh: Almost everything worth doing gets done with two commands, zfs and zpool, and the syntax has been consistent enough over the years that old documentation from Sun still has value.

              I've been using it for most of a decade now and I find it to be ridiculously good. My only wish is that it could be a first-rate player on Linux, but license incompatibilities be that way sometimes.

          • rincebrain 3 hours ago
            The reason to use zvols is twofold, AFAIR:

            - serving a bunch of storage as a blob is a common use case for e.g. iSCSI exporting, and so, if you want to be able to zfs snapshot/send/rollback/etc on the level of "one logical disk", it makes sense to have an optimized route to expose that rather than making you expose a filesystem that only has one file on it to do the same dance

            - avoiding unnecessary overhead/complexity from the FS layer being involved when all you really care about is exposing a single block device of storage

            Of course, in the era where you're sad that inline compression/checksum/etc are bottlenecking your 48 NVMe pool, that probably isn't where you'd reach for optimizing first...or second...

            But just exposing the block storage is sufficiently useful that at least one of the original projects to port ZFS on Linux wasn't planning to implement the FS layer, they just wanted block storage for Lustre.

            • ssl-3 1 hour ago
              I felt the same way about it as you before I started looking for benchmarks as I wrote my previous comment. :)

              After all: Why would zvols exist at all if they weren't superior in important ways?

              > it makes sense to have an optimized route to expose that rather than making you expose a filesystem that only has one file on it to do the same dance

              It's important to note that additional datasets are essentially free on ZFS; it's no big deal to have lots of them (millions of millions of them is A-OK), and datasets don't have a pre-determined size like zvols do.

              Although zvols can also be grown and shrunk, just as files [within datasets] can be.

              Both datasets and zvols make the same kind of mess out of zfs list's unfiltered output.

              But zvols introduce a new concept, while anyone who uses ZFS is already familiar with datasets that contain files.

              I think this part is a wash, and that it comes down to operator preference.

              > avoiding unnecessary overhead/complexity from the FS layer being involved when all you really care about is exposing a single block device of storage

              Maybe? Again, the benchmarks I found (hours ago now and tabs long-closed; I'll find more if anyone insists) suggested that files were faster than zvols, which suggests reduced overhead. (It's very possible that the tests I found were naively implemented, but then: It's also possible for any of us to do something naive.)

              Anyway, it's interesting to think about.

              It seems like the right answer is to test with one's own workload and find the best fit, instead of assume that one way is better than the other.

              For its part, ZFS should handle a zvol and a file-on-a-dataset with equal stoicism and reliability.

  • xk3 10 hours ago
    I've done something similar before with Btrfs

    https://gist.github.com/chapmanjacobd/bc6e31c8bc3647e0bcb0c4...

    pretty fun!

  • anonymous_user9 3 days ago
    > The DVA was correct, the sector math was correct, the dd command was correct. The right place, the wrong mental model.

    God the intensity is tiresome. Whether or not it's AI slop, it's also bad writing. Things can be fun or interesting or worthwhile without being a harrowing battle of discovery!

    • calcifer 15 hours ago
      > Things can be fun or interesting or worthwhile without being a harrowing battle of discovery!

      The quoted sentences used "correct", "right" and "wrong". Hardly the sensationalist words you're implying.

      • rcxdude 9 hours ago
        It's not the word choice, it's the whole tone and structure of the sentence. It reads like a horror writer building up the tension before a big reveal but it just keeps drawing it out over a whole article and for something that isn't worth the build-up. It gets quite tiring to read IMO (LLM writing in general tends to have a grandiosity to it which really grates with something which is meant to be more informative, in my experience. They will explain a section of tax law like it's the second coming of Christ).
      • eigencoder 14 hours ago
        I get what they're saying, it's more about the punchy tone than the word choice
  • lanycrost 20 hours ago
    I miss ZFS, only had a chance once to work with it in production and liked it very much. It's have performance overhead compared to journal filesystems but greatly designed.
  • xeyownt 13 hours ago
    Nice writeup. This is this kind of exercise and information that really help you understand better how things work.
  • igtztorrero 16 hours ago
    I always run my servers on zfs pool mirrored using raid1 on 2 nvme drives, because when nvme fails, fail completely. How can a File be corrupted on normal operations?
    • rincebrain 2 hours ago
      Various ways.

      Drives sometimes are worse at their internal error detection than you might hope, and might return incorrect bytes.

      You might have faulty hardware flipping bits between when you computed a checksum/parity/etc over your data and when you wrote it out, either in memory, or over the wire.

      You might have a software bug or an interaction with a hardware erratum that causes the CPU to misbehave and mangle your bits in certain cases, maybe around switching from running code in a VM to not and back.

      You might have had, say, the Samsung HD204UI hard drive, which loses data after it tells the OS that it's written because of a bug around its write cache, so you get no error back, but you go to read the data back later and it's actually whatever was there before you tried overwriting it.

      SSDs, NVMe and otherwise, _can_ fail in ways that aren't just vanishing from the bus, it's just much less common than with mechanical drives, IME. I have sometimes seen SSDs return incorrect bytes inconsistently or consistently, or start spitting up read/write errors rather than entirely vanishing from the bus.

      Each of the above examples is a real thing I saw happen. None of them is particularly likely, plenty of people never have dumb shit like any of that come up. But it's not never.