Worst SysAdmin Accident [closed]

untagged

In line with the question about Best sysadmin accident, what's the worst accident you've been involved in? Unlike the previous question, I mean "worst" in the sense of most system damage or actual harm to people.

I'll start with mine:

We have two remote wiring closets that are at the end of a 100-foot corridor which has a metal grate for the floor. After we had Cat6 cable installed, the contractors cleaned up all the debris that dropped through the grating to the concrete 3 feet below. A co-worker and I entered the corridor to check on the progress one day but were distracted and didn't notice that a piece of grating had been moved aside. My buddy stepped into air and his chest slammed into the steel crossbar. He was winded and sore enough to take a couple days off, but luckily the steel beam had rounded edges and the size of the opening was such that he didn't smack his head into it or the floor below.

Obviously we learned that areas where the floor is partially removed need to be flagged.

Solution 1:

Imagine if you will living in South Florida during hurricane Andrew (slightly before the 24X7 craze). All of your servers are securely locked up in a building that requires you badge into it and a more secure area requiring an additional scan of your badge. Imagine a nitwit that did not account for needing actual handles on the doors. Imagine a four million dollar contract requiring a delivery, the closest electricity being 230 miles north, gas being in short supply, dangerous roads, and a generator that was designed to provide 48 hours of electricity. Laugh if you will at a collection of servers being in the back of a truck, stuck on the Mickey Mouse turnpike, stalled for want of gas. Laugh if you will at the total lack of an excuse at how bad it all went from a logistical, sysadmin, and operational standpoint. The best part was listening to the hundreds of UPS units crying simultaneously for life giving electricity.

Solution 2:

When I worked for Cisco, I used to get customers who had bought $30 wireless cards and who were spitting chips when their driver wouldn't install, or people with the cheapest most basic router Cisco had who would rant and rave over support issues.

This was all put in context one day, when I received a call from one of the world's largest card providers (think Amex, Mastercard, Visa, Diners... in fact it was one of those brands, I don't know if they would appreciate me mentioning it). I was front-line support, my only job was to assess the scenario, rate it, and put it through to the appropriate support division. This case was the only Priority One case I ever put through.

A man from the card company called up and stated that their link between their east-and-west-coast US mainframes was down. If an account was created on one mainframe, the transaction was always processed on that mainframe. Which was fine if your closest link was always near to that mainframe. But on this particular day, if you had an account on the east-coast server, but you were in the west coast, the transaction would be denied because the link was down.

Standard question when assessing damage was "How much is this costing your business?" The reply, calm and collected, was "About a million dollars every 30 seconds".

Really puts it into context next time you feel tempted to rant and rave to customer support over you $30 wireless card.

(it should be noted that Cisco had his link up and running within 5 minutes of being transferred)

Solution 3:

It's very common to alias commands like rm or mv to add the '-i' option to avoid mistakes. But this happend in my company a while ago. Someone put this line in root's .bashrc in one of the servers.

alias rm='rm -i'

Then it copied the line and substitute rm for mv... or so he thought:

alias rm='rm -i'
alias mv='rm -i'

The rest is history :)

Well, the thing is that when mv'ing the 'are you sure' question said 'remove' instead of 'move' but yet...

Solution 4:

We were installing a massive Point of Sale system at a large retailer (over 1000 branches). The central polling server was all custom HP-Unix code, and the test to production migration was handled by a single guy - the IT Director's son.

This guy spent 7.95 hours of his day reading Fantasy novels, and the other few minutes running his batch job to migrate nightly builds to production. The system was 3 days from going live at 150 of the branches (our first "real" rollout). Everything was set, and my team had just finished testing the final pieces of code. We commited our changes and moved our images from development to test to be picked up by the IT Director's son the next morning.

I get there at 8:00am and everything is in chaos. Turns out that the son had been instructed that after copying files to production, he was supposed to go into the ./changed folder and type "rm -rf *". Yes, someone actually told him this! Of course, he accidentally did this on the production root drive, which also housed our transactional polling database (which happened to be offline for backups at the time, just our luck).

Result: Our 16 pilot stores had to serve customers out of cigar boxes (in some cases, literally) for 2 days. The CIO's son was demoted to Server Watcher (he sat in the freezing cold server room and was supposed to watch for red lights ... but he wasn't allowed to touch anything ... they didn't even give him a computer and revoked all his logins/email). Our development team pulled an all-nighter rebuilding lost data from backups and retesting/resubmitting code.

We luckily made the 150 branch rollout, but it was the worst rollout experience EVER.

Solution 5:

I learned to finish every command sentence before hitting the Enter Key.

A slightly similar situation that I face is when I'm not sure about a command, I press Home and type some junk characters so that the command is not a recognised one.

me@mypc:~$ sdkjfhdsudo mv --too-many --switches-to-be --comfortable --working-with --while-running --an-important-command /here/this /there/that

bash: sdkjfhdsudo: command not found

And then I check the options again, slowly if need be. Does anyone else do such a thing. Of course, you have to ensure that you type sufficient junk chars (5+), to prevent it from becoming another valid command and doing more unpredictable damage.

(Is there a basic flaw in this that I have not figured out or a situation where, given 5+ junk characters, typically in the "asdfghjkl" keys, it does something unpredictable?)