As you may have seen if you are on the fedora announce list, we had an outage the other day of our main build system NFS storage. This meant that no builds could be made and also data could not be downloaded from koji (rpms, build info, etc). I thought I would share here what happened so we can learn from and try and prevent or mitigate this happening again.
First, a bit of background on the setup: We have a storage device that exports raw storage as iSCSI. This is then consumed/used by our main nfs server (nfs01). It’s using the device with lvm2, and has a ext4 filesystem on it. It’s around 12TB in size. This data is then exported to various other machines to use, including builders, kojipkgs squid frontend for packages, koji hubs and release engineering boxes that push updates. We also have a backup nfs server (bnfs01) that has it’s own separate storage with a backup copy of the primary data.
On the morning of December 8th, the connection between the iSCSI backend and nfs01 had a hiccup. It retried current in progress writes, and then decided it could resume ok and kept going. The filesystem had “Errors behavior: Continue” set so it kept going (although no actual fs errors were logged, so that may not matter). Shortly after this, NFS locks started failing and builds were getting I/O errors. A lvm snapshot was made and a fsck run on that snapshot, which completed after around 2 hours. A fsck was then run on the actual volume itself, but that took around 8 hours and showed a great deal more corruption than the snapshot had. In order to get things into a good state, we then did a rsync of the snapshot off to our backup storage (which took around 8 hours), and merged that snapshot back as the master fs on the volume (which took around 30min to complete). Then, a reboot and we were back up ok, but there were some small number of builds that were made after the issue started. We purged them from the koji database and re-ran them with the current/valid/repaired filesystem. After that builders were brought back on-line and queued up builds processed and things were back to normal.
So, some lessons/ideas here, in no particular order:
- 12TB means most anything you decide to do will take a while to finish. On the plus side that gives you lots of time to think about the next step.
- We should change the default on-error behavior to at least ‘read-only’ on that volume. Then errors would at least stop further corruption, and best prevent the need for a lengthy fsck. It’s not entirely clear if the iSCSI errors would have made the fs hit error condition or not however.
- We could do better about more regular backups of this data. A daily snapshot and rsync off of that snapshot to backup storage could save us time in the event of another backup sync being needed. We would also then have the snapshot to go back to if needed
- Down the road some of the cluster fses might be a good thing to investigate and transition to. If we can spread the backend storage around and have enough nodes, the failure of any one might not be as much impact.
- Perhaps we could add monitoring for iscsi errors and note and react to them quicker
- lvm and it’s snapshots and ability to merge a snapshot back in as primary really helped us out here.
Feel free to chime in with other thoughts or ideas. Hopefully it will be quite some time since we have another outage like this one.