close
close

Reimagining a new era of incident response as LLMs evolve

As companies in Asia Pacific continue to accelerate their digitalization, they are under enormous pressure to keep everything running in an increasingly complex IT environment. There is arguably more at stake in the region than anywhere else. New Relic’s 2023 Observability Forecast found that Asia Pacific has by far the highest average annual outage costs – more than double that of Europe and nearly 16 times that of North America.

Your IT teams not only have the responsibility to find and resolve incidents as quickly as possible, but also to prevent these costly incidents from occurring again. Of course, many IT leaders in the region are watching the emergence of AI and the development of large language models (LLMs) and their potential to transform incident response as we know it.

Prevention is the north star of AI incident response, but experience counts

Many teams are already starting to see how AIOps technology can help minimize issues or impacts on the customer experience, such as: B. through proactive anomaly detection, incident correlation to reduce alert noise, and automated likely root cause analysis.

The promise of AI in minimizing IT incidents appears to be limitless, with some even going so far that it will ultimately achieve the goal of preventing disruptions and outages altogether. However, skipping basic steps along the way or limiting the experience of IT teams working on incident response today could prove detrimental to the advancement of LLMs.

For many IT teams today, it still takes too long to identify potential problems before they become incidents. Teams often work reactively and combat incidents, but never find the time to implement processes that allow them to identify problems before they cause disruption.

To master prevention with the support of LLMs, teams must survive incident detection and resolution. You cannot skip this step because it is the experience a user learns from finding and resolving incidents that allows them to learn the skills to implement mitigation strategies and take preventative measures. The experience will enrich both human teams and LLMs’ ability to understand and streamline large data sets and accomplish the diverse tasks within the incident response lifecycle.

Three ways LLMs will transform incident response

The incident response lifecycle can vary from organization to organization and even from team to team. Here are some of the opportunities within critical tasks across the incident response lifecycle:

  • Research: When an incident occurs, an engineer’s first step is to gather information and investigate the problem area. LLMs play an important role in this process. By having access to current and historical data, LLMs are able to analyze the incident, search past incidents to draw on previous experiences, and reflect on this data to recommend a possible path forward. By taking on the role of researcher, SRE teams can save significant amounts of manual labor hours.
  • Troubleshooting and Diagnostics: As LLMs evolve, teams can draw on the same research capability and leverage broader knowledge bases to help investigate an incident, including identifying runbooks applicable to an incident. As the knowledge base extends beyond the organization to external knowledge, AI agents can perform automated root cause analysis through iterative evaluation of hypotheses based on local experiences and world knowledge. You will be able to emulate human insights and engage with human teams to reason and take action to close gaps in earlier phases and then support with suggestions. The value to technology lies in a shorter average time to understand the impact and causes of incidents, while the value to the business lies in a shorter average time to resolution.
  • Autopsy and documentation of incidents: Engineers collect, summarize and perform an autopsy following an incident. Incident post-investigation is about analyzing errors to gain insight into why they occurred, how they impacted operations, and most importantly, how to prevent them in the future. This process can take weeks. Through search, summarization and reasoning skills, LLMs can facilitate the initial stages of creating a post-incident review by collecting, compiling, summarizing and analyzing the data and then making recommendations on remediation strategies, thereby reducing the cognitive load on engineers and these stored for a considerable period of time.

As LLMs become more sophisticated, companies and their IT teams can certainly look forward to the benefits in the way incidents are managed and ultimately prevented. The caveat is that there are no shortcuts to the process and, more importantly, there is no substitute for the lived experience of human teams.

LLMs require that human teams have extensive lived and documented experience in incident response to be able to carry out tasks effectively based on logical reasoning. Only then will the tools have the expected positive impact on incident response times, resolution times and overall results. The next chapter of incident response will be driven by a more efficient way companies respond to, manage and learn from incidents, underpinned by intelligence, automation and human-machine collaboration.

The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Photo credit: iStockphoto/ Yuri Altukhov