cdaringe 13 hours ago

Explosion at the factory. No humans injured, but RnD labs and manufacturing labs all halted due to side effects of the explosion. On call back then was just answering your phone. I foolishly answered. It was incorrect for my management to place such heavy responsibility onto me for leading our sliver of the recovery. I was fresh out of school, I did not understand how the products, the facilities, or how the business operations fundamentally worked at a low level, which would’ve deeply helped. Sure, there were senior engineers around, but they had other tasks in these critical moments as well. It took a while to feel supported. They really needed an experienced engineer running the show, not the most junior. Anyway, I worked for a month straight (longer days, weekends). A handful of us did. I was actively training for bike racing back then, and I knew my KPI‘s. Let’s just say that my fitness tanked after being uber stressed out for a month straight.

noop_joe 14 hours ago

One of the most difficult challenges with incidents is dispelling the initial conjecture. Something bad happens and a lot of theories flood the discussion. Engineers work to prove or disprove those theories, but the story about one might take on a life of its own outside the dev team. What then ends up happening is post-incident there's a lot of work to not only show that the problem was the result of XYZ, but also it definitely wasn't the result of ABC.

I was responsible for wsj.com for a few years. The homepage, articles and section fronts were considered dial-tone services (cannot under any circumstances go down). My job was to lead the transition from the on-prem site to the redesigned cloud site. As you can imagine there were a few hiccups along that journey.

One particular incident we encountered was when reporters broke the news of a few unrelated industry computer system failures (including finance). Because it was about a financial system, people visited wsj, the spike in traffic was so large it knocked us out. Now other news outlets were reporting wsj down. Unfortunately, there was a perception that these incidents were all related by a coordinated hacking event.

Each minute the site had an interruption of service, I would need to spend hours post-incident making sure the causes were understood, verified and stakeholders knew what they were.

All in all, the on-call experiences were fine. Sure people were tired if they happened in them middle of the night, but the team was supportive and there was a culture of direct problem solving that didn't add _extra_ stress.

Fun stuff.

ezekg 4 days ago

I wrote about my worst on-call experience here [0]. Back in Feb of this year, I had a unique customer workload take down my SaaS in the middle of the night, 2 nights in a row. It took a long time to find the root cause, but it ended up being an inefficient uniqueness lib I was using for background jobs. This particular customer's workload was queuing up millions of background jobs in Redis at a certain time every day, but each time a job was queued, the entire job set was iterated (synchronously) to assert uniqueness. Obviously, this didn't scale.

I'd rank these 2 nights as the most stressful times of my career. I wrestled with sleeplessness, hopelessness, imposter syndrome, etc.

[0]: https://keygen.sh/blog/that-one-time-keygen-went-down-for-5-...

aristofun 4 days ago

After half year in production finally some malicious user hit my code with injection that crumbled the hi throughput system.

This was surprising for 2 reasons:

1. That nobody came up with the injection for so long.

2. That the injection was pretty silly and non obvious (kinda explains first point lol). Not your typical sql/js/css etc. it had something to do with localization and the cumbersome way it was implemented in some java libraries