How I Got Permission to Burn It All Down

There’s a class of bug that will make you feel like you’re losing your mind.

It doesn’t happen every time. It doesn’t happen on command. It happens at 2pm on a Tuesday, or right when a client is watching a demo, and then it doesn’t happen again for three days. When you try to trace it, you’re watching five things happen simultaneously, any one of which could be the culprit. The answer changes depending on timing you can’t control.

That was the iOS app I inherited.

The problem

The app handled communications with a piece of external hardware that could send and receive messages at any time. Events could originate from either the hardware or the user, often in rapid succession. The original architecture handled each incoming message by spinning up a new thread. It was a reasonable approach to the problem, given the constraints at the time.

The problem was state. If the hardware sent a message assuming a certain state, and a user action had just changed that state on a different thread, you had a crash. Or worse, silent data corruption. And it was nearly impossible to reproduce reliably. Shift the timing by milliseconds and you’d get a different result. Unit tests couldn’t catch it. Code review couldn’t catch it. It just happened, unpredictably, in production.

When the previous team director left, we brought in outside contractors to assess the situation. One of them built me a diagram showing the call stack depth for a single event in the system. It looked like a subway map with no center. There was no reliable way to know, given any starting point, where you would end up.

The contrast

Meanwhile, our Android team had a similar product, built differently. Every incoming message dropped into a queue regardless of source. One thread pulled from that queue and processed messages in order, one at a time. No collisions. No ambiguity.

I took this contrast to my manager and walked him through tracing a single bug on each platform side by side. On iOS, you had to open multiple threads simultaneously, watch shared state being updated from different directions, and try to determine which update “won” and in what order. If the timing had been slightly different, the answer would have changed too. On Android, you followed the queue, read the log linearly, and wrote a unit test that reproduced it exactly.

He brought in the CTO. We showed him the same comparison.

The choice was clear. One system was consistently crashing. The other was stable. One had bugs you couldn’t reproduce. The other had bugs you could fix.

Making the case

The key was leading with the cost of the existing system rather than the appeal of a new one. The instability wasn’t just a technical problem. It was an ongoing drain on engineering time. Every hour spent trying to reproduce a threading bug was an hour not spent building features. Every contractor brought in to untangle the architecture was money spent with no clear end in sight.

A rewrite carried real risk. But continuing as-is meant accepting a steady, open-ended cost with no path to resolution.

We got four months approved to build an MVP. We adopted a state machine model with a message queue, where all interactions were traceable, all state transitions were explicit, the logic was unit-testable, and everything was debuggable. That became the foundation for an enterprise-level library the team could iterate on, test reliably, and actually reason about.

What I learned

Two things stayed with me from this experience.

The most persuasive argument for a hard decision is usually a comparison, not an assertion. I didn’t tell leadership the architecture was bad. I showed them what debugging looked like in each world and let the contrast speak for itself.

The other thing is that determinism is underrated. Engineers spend a lot of time chasing performance and new features, and not always enough time asking whether a system is actually predictable and testable. A system that is a little slower but completely predictable will outlast a faster, chaotic one, especially as the team and codebase grow.

Share this:

Related