Reading on the Road: Responsibility, Observability, and Refactoring

Today I had a lot of time at my disposal as I sat on a bus from Momotombo to León. With Reddit blocked by my mobile data provider here in Nicaragua, I turned to my second most visited website HackerNews. Three articles immediately caught my eye and made me reminisce.

The first article is from Graham-Yooll and reminded me of my own career journey at Thoughtworks. The article shares an important advice: "Take on responsibility before you're officially given it.". I strongly agree with this take. When I started my career as a graduate consultant at Thoughtworks, I quickly took on responsibility and thought about what I could do to support my team and my tech lead. The career progression and feedback culture of Thoughtworks supports such behavior. Although in the role of a graduate consultant the sole expectation is to focus on yourself. I was lucky to have the opportunity to work with tech leads that gave me constructive feedback and opened the doors for me to gain valuable experience. For example I started to facilitate retrospectives, tech debt conversations and threat modeling workshops. My goal has always been to grow into the tech lead I personally would want to work with.

Graham-Yooll also emphasizes the importance of reliability. One-off achievements are great but managers need the reassurance that you can deliver consistently. When I was promoted to Senior Consultant at Thoughtworks, it wasn't just because I had exceeded expectations at one team. It was because I had demonstrated that performance across multiple teams and projects over time. Once a week I would write down my achievements and I also regularly collected feedback from my peers. With the recorded feedback and list of achievements I was able to demonstrate my consistent performance in the promotion process.

The next post was from a founder that wrote about the struggle of teams to make sense of observability. In his post "Observability's Past, Present, and Future" he writes about why he thinks that observability still sucks. While I read this, what reads like a pitch for an AI product, I was thinking about past projects. I was confused because in my memory the teams I had worked with hadn't struggled as much as Sherwood would want the readers to believe. Quote: "Suddenly, our old reliability playbook stopped working. We couldn't predict all the edge-cases, let alone write tests for them. Failure modes increasingly emerged from the complex interaction between services. When it came to root-cause analysis, logs and basic metrics were no longer enough."

I wonder how those days of cloud adoption must have been. The notion that with microservices the complexity grew to a level you couldn't test for edge cases anymore sounds like something else. This sounds more to me that teams shifted their testing strategy to the right (production) rather than orienting themselves to the testing pyramid. I am glad we now have devops metrics like change failures because that allows us to get a view not only into the development phases of software but also the effectiveness of the process and the quality gates that are employed.

Quote: "Instrumentation takes forever. Dashboards are constantly out of date. Monitors misfire. Alerts lack context. On-call is a permanent tax on engineering productivity. Incident resolution can take hours, sometimes days." Sherwood's point is that with all of the complexity came dashboards, monitors and alarms that should alert the team about issues in the production system. Given that those signals can grow into a myriad of signals, he is saying that observability got worse. Well in my opinion if you have the problems described in the quote, then reevaluate each monitor and alarm. Refactor the triggers and alarms that don't make sense, store dashboards as infrastructure as code and update them regularly. Also document recipes for known issues. I tried the Wheel of Misfortune and I was instantly a fan of the training method using role play to allow teams to learn together in a safe space what certain signals and alarms could mean. Similar to feature toggles it is an anti-pattern to let their numbers grow too large. They need to have value to be allowed to stay around. This goes for monitors and alarms as well. And here comes the catch: "The amount of effort we put into observability does NOT line up with the progress we've made towards its goals: better detection, faster root-cause analysis, and more reliable apps." In my opinion observability is not going to deliver more reliable apps. You need a solid vertical slice of quality gates to make sure your apps have zero down time. The testing strategy needs to be capable of catching edge-cases before the commit can reach the production system. When teams employ a testing strategy that follows the Swiss Cheese Model you will have much less problems. I would guess that Sherwood's new company is working on an AI tool that aims to make sense of the signals for teams. The developers though need to do their homework in order to get the most value from their observability signals.

What finally inspired me to write this post was an article from Ron Jeffries "Refactoring -- Not on the backlog!". He argues that refactoring shouldn't be a separate backlog item. Instead of asking for dedicated time to clean up messy code, developers should incrementally refactor as they build each new feature, cleaning the code path they need rather than working around it. This approach is easier to justify, delivers immediate benefits, and makes development progressively faster. This is also known as the "Boy/Girl scout rule" or "Leave the code better than you have found it". Jeffries wrote this post to remind us of this rule and I could imagine it is a reflection of what he sees in his daily work. I've seen developers discouraged from refactoring because it doesn't align with 'business value,' yet watched codebases deteriorate until major refactoring became unavoidable. Furthermore in the age of agentic software development, it is already so much easier to add code to a codebase than to clean it up. It therefore seems only logical to me that developers, driven by pressure from their organization, care more about completed tickets than clean source code.

However, I believe that we should still create tickets focused on technical debt. The real question is whether technical debt is being worked on continuously. The danger I see if teams only follow the "Scout" rule is that it creates uncertainty on the management side about where the developers' focus lies. In my opinion, the technical and business parts of the team should regularly discuss their technical debt. Every change in the source code brings technical debt in one way or another, and it's better if everyone involved is aware of this. Managing which technical debt is acceptable and which is a burden is the most important thing. Metrics like DevOps metrics help measure how technical debt affects the software development process. Additionally, it helps to collect measurements of the most important features through so-called "fitness functions," keyword Evolutionary Architecture.

Here's a real-world example:

The team had a running product in a production system
There was a rotating pair of developers monitoring the production system for issues called Tech and Ticket (TnT). Both developers pair programmed on the issues, allowing them to learn from each other and give constructive feedback in real-time.
Their primary focus was to resolve production issues
When there was no production issue they would work on tech debt stories
The tech debt stories would be created/approved by the developers and business roles together in a bi-weekly grooming session. This way tech debt stories would be prioritized by all stakeholders, potential solutions could be discussed and tickets could be approved.

The team in this example still applied the "Boy/Girl scout rule" to solve smaller issues and got time from the business stakeholders to improve bigger tech debt. Including business stakeholders in these conversations ensured that the TnT pair was allowed to work on technical debt that was seen as valuable by all parties.

When I read Jeffries' post I wondered how something as obvious as the "Boy/Girl Scout" rule would be worth writing about. But then I realized: both Sherwood's observability problems and Jeffries' refactoring plea have the same root cause. Teams let complexity accumulate instead of continuously maintaining quality. Whether it's mountains of poorly-configured monitors or tangled codebases, the solution is the same: apply the Scout Rule consistently. Clean up incrementally. Don't let technical systems like monitoring dashboards or code grow into unmaintainable messes that require heroic 'dedicated time' to fix. Just as I learned at Thoughtworks that consistent small improvements lead to promotion, our systems need that same philosophy.