Short Text Note by arclight (reply)

Once upon a time - around 2007 or so, just before I left sysadminery to do risk assessment on legacy radioctive waste cleanup - I was upgrading a Blackboard LMS. Hardware load balancer in front of two unreliable web/app servers, Oracle db with RMAN backups, NFS file store backed by iSCSI SAN storage for user files. The web front-ends were intentionally provisioned with low disk because they didn't need it - content was on the NFS server or in the database and all the front-ends needed disk for was swap and log files from Apache and Tomcat. Those were religiously scraped because why would you expect the vendor to rotate their giant useless log files or record to a remote log host when they could just let crap accumulate everywhere until their system fell over?

But I digress. I had automated log cleanup and database backup (with tested restores) and main IT managed the SAN backups. It was that brief Windows of low usage between semesters when we could run the vendor-supplied binaries to upgrade this expensive and cursed assemblage of Java, Perl, Oracle, and human misery. Read the documentation multiple times to understand the order of upgrade operations, clean and quiesce the system, take a few final backup snapshots and pull the trigger.

The upgrade worked as intended, dutifully _moving_ files from the NFS mount of the large iSCSI drive to the web front-ends, filling the disk, then shitting the bed and falling over leaving the system in an unknown and unrecoverable state. As one does when you are Blackboard, the usurious vandal of LMS vendors.

Surveying the flaming wreckage, I called Bb support to as for guidance. The support peon was impressed with my calm tone. I responded that being outwardly furious and losing my shit at them was unlikely to recover my system or improve any outcome I cared about. I did ask in my support ticket if this behavior from the update was documented and if there was any way I had missed a critical "do this to avoid incinerating prod" step in the upgrade process.

A few hours later I got a response back from upper tier support that no, I had read and done everything correctly according to their documentation and that this whole fiasco could have been avoided by the use of an _intentionally undocumented_ option to the updater.

Their words: _"intentionally undocumented"_.

Why? Somebody might get confused by the explanation so it was omitted in the interest of ... clarity?

I spent several harrowing hours waiting for the iSCSI restore to complete. I did my best to verify no user content was lost but to this day I don't know if we lost data.

Deleting one symlink in the filesystem would have prevented this problem, a symlink that was required in a previous version of the code for the system to work properly (one actually described in the vendor documentation).

We were a small university and did not have a full replica dev system to test the updater on. Why would we? We explicitly did not do development, we ran a vendor-supplied code in production. Dev systems were for developers which we were not. This wasn't a matter of a spinning up some virtuals in the cloud - we bought and managed real hardware and there was no way in hell we could justify doubling our hardware investment to have a test environment just to verify vendor supplied code worked as advertised. There wasn't a possibility of auditing the updater to detect that it would copy and delete the entirety of user-uploaded content, not without decompiling a big blob of Java. I exercised what diligence I could given the garbage state of the vendor's code and still got royally fucked over.

I owned that. I informed my management chain of the situation and kept them updated with new developments and a revised ETA until the system was stabilized, recovered, and updated. That is how I practiced server ops for a decade (1998-2008). You did your dligence, said a prayer as you pushed the button, and you owned the outcome.

No idea what current practice is. That was almost 20 years ago before devops, virtuals, clouds, and containers replaced real machines and dedicated sysadmins. I would like to believe that outlook and practice carried forward since then but I don't know - I left for greener, safer, more relaxing and fulfilling pastures helping package, transport, and store Cold War era uranium-metal-bearing radioactive sludge, moving it out of crumbling fuel pools at Hanford to interim storage elsewhere at Hanford. Then later projects for Dounreay, Sellafield, US commercial plants, Swedish interim waste storage at CLAB underneath Oskarshamn, fire PRA for plants in the US, Sweden, and Spain, and a whole lot of safety analysis code development and software QA. Somewhere in there I live-tweeted Fukushima melting and exploding. That's possibly the most direct act of nuclear safety I've performed with the goal of keeping people informed well enough to contextualize what was happening so they didn't panic and hurt themselves.

arclight on Nostr: Once upon a time - around 2007 or so, just before I left sysadminery to do risk ...