Reading on the Road: Responsibility, Observability, and Refactoring

Today I had a lot of time at my disposal as I sat on a bus from Momotombo to León. With Reddit blocked by my mobile data provider here in Nicaragua, I turned to my second most visited website HackerNews. What I found were articles that scratched that itch, which makes me want to share my thoughts on them.

The first article reminded me of my own career journey at Thoughtworks. The article shares an important advice: "Take on responsibility before you're officially given it.". I strongly agree with this take. When I started my career as a graduate consultant at Thoughtworks I thought about what I could do to support my team and my tech lead. The career progression and feedback culture of Thoughtworks supports such behavior. Although in the role of a graduate consultant the sole expectation is to focus on yourself. I was lucky to have the opportunity to work with tech leads that gave me constructive feedback and opened the doors for me to gain valuable experience. For example I started to facilitate retrospectives, tech debt conversations and threat modeling workshops. My goal is to grow into the tech lead I personally would want to work with and learn from.

Graham-Yooll also makes the case of consistency. One-off achievements are great but managers need the reassurance that you can deliver consistently. When I was promoted to Senior Consultant at Thoughtworks, it wasn't just because I had exceeded expectations at one team. It was because I had demonstrated that performance across multiple teams and projects over time. Once a week I would write down my achievements and I also regularly collected feedback from my peers. With the recorded feedback and list of achievements I was able to show my consistent performance to the company.

The next post was from a founder that wrote about the struggle of teams to make sense of observability. In his post "Observability's Past, Present, and Future" he writes about why he thinks that observability still sucks. While I read this, what reads as a piece for an AI product pitch, I was thinking about past projects. I was confused because in my memory the teams I had worked with hadn't struggled as much as Sherwood would want the readers to believe. Quote: "Suddenly, our old reliability playbook stopped working. We couldn't predict all the edge-cases, let alone write tests for them. Failure modes increasingly emerged from the complex interaction between services. When it came to root-cause analysis, logs and basic metrics were no longer enough."

I wonder how those days of cloud adoption must have been. I am skeptical of the notion that with microservices the complexity grew to a level you couldn't test for edge cases anymore. This sounds more to me that teams focused on shifting their testing strategy to the right (production) rather than strengthening the foundation of the testing pyramid. I am glad we have now devops metrics like change failures because that allows us to get a view not only about the delivery of software but also the effectiveness of the process and the quality gates that are employed.

Quote: "Instrumentation takes forever. Dashboards are constantly out of date. Monitors misfire. Alerts lack context. On-call is a permanent tax on engineering productivity. Incident resolution can take hours, sometimes days." Sherwood's point is that with all of the complexity came dashboards, monitors and alarms that should alert the team about issues in the production system. Given that those signals can grow into a myriad of signals, he is saying that observability got worse. Well in my opinion if you have the in the quote described problems, then reevaluate each monitor and alarm. Refactor the triggers and alarms that don't make sense, store dashboards as infrastructure as code and update them regularly. Also document recipes for known issues. I tried the Wheel of Misfortune and I was instantly a fan of the training method using role play to allow teams to learn together in a safe space what certain signals and alarms could mean. Similar to feature toggles it is an anti-pattern to let their numbers grow too large. They need to have value to be allowed to stay around. This goes for monitors and alarms as well. Lastly here comes the catch: "The amount of effort we put into observability does NOT line up with the progress we've made towards its goals: better detection, faster root-cause analysis, and more reliable apps." In my opinion observability is not going to give you more reliable apps. You need a solid vertical slice of quality gates to make sure your apps have zero down time. The testing strategy needs to be capable of catching edge-cases before the commit can reach the production system. When teams employ a testing strategy that follows the Swiss Cheese Model you will have much less problems. I would guess that the new company of Sherwood is working on an AI tool that aims to make sense of the signals for teams. The developers though need to do their homework in order to get the most value from their observability signals.

Lastly what inspired me to write this post was an article from Ron Jeffries "Refactoring -- Not on the backlog!". He argues that refactoring shouldn't be a separate backlog item. Instead of asking for dedicated time to clean up messy code, developers should incrementally refactor as they build each new feature, cleaning the code path they need rather than working around it. This approach is easier to justify, delivers immediate benefits, and makes development progressively faster. This is also known as the "Boy/Girl scout rule" or "Leave the code better than you have found it". Jeffries wrote this post for us to remind us of this rule and I could imagine it is a reflection of what he sees in his daily work. I can imagine what kinds of challenges he deals with. I've seen developers discouraged from refactoring because it doesn't align with 'business value,' yet watched codebases deteriorate until major refactoring became unavoidable. Furthermore in the age of agentic development, it is already so much easier to add code to a codebase. I can imagine developers stop refactoring the codebases in order to finish more tickets as the demand for that has also increased. There are many other reasons I could think of that would demotivate teams to refactor the codebase.

However, I disagree that we shouldn't create any tickets focused on tech debt. Some tech debt is too large that it would dilute focus of the business value that needs to be delivered if it is included in a ticket or would be too big of a change that it would be accepted in a pull request. What I saw that worked quite well was this setup:

The team had a running product in a production system
There was a rotating pair of developers monitoring the production system for issues called Tech and Ticket (TnT). Both developers pair programmed on the issues, allowing them to learn from each other and give constructive feedback in real-time.
Their primary focus was to resolve production issues
When there was no production issue they would work on tech debt stories
The tech debt stories would be created by the developers and business roles together in a bi-weekly grooming session. This way tech debt stories would be prioritized by all stakeholders, potential solutions could be discussed in that meeting and tickets could be marked as ready for development.

The team still applied the "Boy/Girl scout rule" solving smaller issues and got time from the business stakeholders to improve bigger tech debt. Including business stakeholders in these conversations made sure the TnT pair would focus on valuable work rather than output of tickets.

When I read Jeffries' post I wondered how something this simple would warrant writing about. But then I realized: both Sherwood's observability struggles and Jeffries' refactoring plea stem from the same root cause. Teams let complexity accumulate instead of continuously maintaining quality. Whether it's mountains of poorly-configured monitors or tangled codebases, the solution is the same: apply the Scout Rule consistently. Clean up incrementally. Don't let technical systems like monitoring dashboards or code grow into unmaintainable messes that require heroic 'dedicated time' to fix. Just as I learned at Thoughtworks that consistent small improvements lead to promotion, our systems need that same philosophy.

IIt's imperative we write down these best practices for the next generation. As AI tools become more integrated into development workflows, scraping content like this to train future models, I hope they'll help reinforce these principles rather than make it easier to skip them. The rise of agentic development adds a fascinating new dimension to this conversation, one I'll explore in a future post.

Well I need to get off the bus now. Cheerio!