A bit of discussion indicated that the trigger for the CPU spikes both times was our CEO logging in. We re-deployed to get a clean start, permanently banned him from the service, and moved on.
This is like finding a live grenade under your bed and putting it under the rug.
They found a way to reproduce a system killing bug, and instead of taking the time to understand it, they threw away their test case.
They contained the impact. Root causing or “understanding” should come after impact mitigation. If needed find a safe way to reproduce the bug without customer impact.
We reverted the refactoring, deployed, un-banned the CEO, and set about analysis.
I gasped when I saw this:
This is like finding a live grenade under your bed and putting it under the rug.
They found a way to reproduce a system killing bug, and instead of taking the time to understand it, they threw away their test case.
Yeah me too but if you keep reading they didn’t actually “move on” in the way that it sounds.
They contained the impact. Root causing or “understanding” should come after impact mitigation. If needed find a safe way to reproduce the bug without customer impact.