The Future Of Reliable Software Systems

Arran McCabe

Since the invention of software, operators have been required to keep systems running and fulfilling their purpose. This practice has remained relatively unchanged since the Industrial Revolution when engineers would dutifully monitor rows of gages and dials to ensure their machinery kept on running. This makes sense, software systems are industrial machinery, simply operating in the domain of bits rather than atoms. This article is an exploration of the limits of traditional operational approaches and where we go from here.

Trends

The complexity and criticality of software systems has consistently increased over time. We’ve moved well beyond the classic two-tier architectures of the 90s and 00s to a world of micro-services, edge computing, and serverless architectures. This increased complexity has been necessary to meet the demands of modern workloads. However, the failure modes of these systems have also become more complex. After all, big complex things fail in big complex ways.

What I believe will add fuel to this fire in the coming years is AI code generation. Tools like GitHub Copilot are radically increasing developer productivity. GitHub claims Copilot decreases task completion time by 55%. This productivity boost will, in turn, dramatically increase the volume of code in production in need of operation. A trend we can only presume will accelerate in terms of adoption and efficiency.

The current economic climate is also placing pressure on companies to reduce their burn. The zero-interest-rate environment of the past decade enabled teams to offset their operational burden with aggressive hiring, which is no longer feasible. Even if capital was freely available, the number of engineers capable of operating these complex systems is finite.

The convergence of these trends means that operators will need to support vastly more software, to a higher standard with fewer people. The only solution, in my mind, is smarter tools and automation. Let's explore the incident lifecycle and see where new approaches could help.

Detection

In today's software landscape, we're grappling with a paradox: as system complexity soars and storage costs plummet, we're inundated with data yet struggle to harness its full potential. This is the quintessential 'champagne problem' of modern computing—having an abundance of data but lacking effective means to utilize it. The central challenge? Balancing between capturing all relevant data (recall) and avoiding an overwhelming amount of non-critical information (noise), a task that becomes increasingly arduous as systems expand and evolve rapidly.

Traditional methods, such as selecting a few metrics for monitoring, are becoming obsolete in this dynamic environment. Even advanced AIOps strategies, though a step forward, are hampered by high costs, development demands, and they still leave the tedious task of alert threshold maintenance unresolved.

To bridge this gap, I advocate a hybrid, three-step strategy: Metric Selection, Threshold Selection, and Feedback. This approach automates the identification of the most impactful metrics using NLP to interpret each metric's essence from its name, namespace, and metadata. Concurrently, statistical analysis of historical data aids in setting appropriate thresholds. This methodology excels in infrastructure metrics and shows promise for application-level data.

The payoff of this strategy is multifaceted: it significantly accelerates the detection of operational issues and simplifies their resolution. More importantly, it shifts the paradigm from manual monitoring to an intelligent, automated system. For operators, this means less noise, reduced debugging time, and minimal effort in maintaining alert configurations. In essence, it's about transforming abundant data from a challenge into a strategic asset, enhancing proactive system management and operational efficiency.

Diagnosis and Resolution

The detection of an incident is merely the first step; its diagnosis and resolution are where the real challenge lies. Here, the transformative potential of Large Language Models and generative AI, exemplified by tools like ChatGPT and Bard, cannot be overstated. These developments mark a departure from traditional practices where human engineers were indispensable for planning, reasoning, and reflection. Enter AI Agents, a new breed of semi-autonomous programs capable of not just reasoning and tool utilization but also of collaborating with humans and each other, heralding a new era in operational efficiency.

Consider a typical scenario: a glitch surfaces at 2 am. Traditionally, this would mean waking a human engineer for assessment. Now, imagine an AI agent stepping in as the first responder, leveraging monitoring data to initiate an investigation instantly. This agent would not only perform standard checks—like reviewing recent deployments and scanning logs—but do so with unparalleled speed and reliability. Even in situations requiring human intervention, the context provided by the AI agent would render the human response far more effective.

The economic rationale for this shift is undeniable, especially considering the steep cost of downtime for enterprises. Investing a fraction of this cost in AI model usage is a strategic move, one that promises substantial savings. Privacy and security considerations are, of course, paramount, but they can be addressed through a combination of self-hosting, data redaction, and human oversight.

As these AI agents evolve, becoming more adept at understanding system nuances through techniques like RAG and knowledge graphs, their effectiveness will only increase. They'll not only assist in incident resolution but will also proactively prevent issues, like reverting problematic code deployments or adjusting system capacities. This advancement will drastically reduce incident durations, allowing engineers to concentrate on their primary objective: adding value to the customer experience.

Conclusion

Looking ahead, we're entering an era dominated by AI-generated software, necessitating systems capable of near-autonomous operation. Having experienced the life of an on-call engineer, I can confidently say that the integration of AI in software operations isn't just a luxury—it's an imperative.

AI Devs, A Double-Edged Sword

November 9, 2023

Life On The Critical Path

Arran McCabe