Using ML to Accelerate Incident Management

Using ML to Accelerate Incident Management
2023-10-3 20:0:37 Author: securityboulevard.com(查看原文) 阅读量:11 收藏

When a security incident occurs, you don’t want to be caught off guard. You want to quickly know about the issue, drill down into the cause and fix it, ideally before the incident affects anything on the user side. However, traditional application performance monitoring and remediation efforts don’t always pinpoint issues correctly and often create a deluge of information to sift through, slowing root-cause analysis.

I recently met with Ajay Singh, CEO of Zebrium, and gathered input from other experts to consider the role machine learning (ML) might play in our new age of AI-driven cybersecurity defensive measures. If adopted correctly, AI and ML could advance these efforts in many ways, such as spotting errors and vulnerabilities, communicating issues to engineers and continually improving defensive postures.

Below, we’ll consider some ways machine learning (ML) and generative AI could accelerate incident management. We’ll see how large language models (LLMs) could be leveraged to monitor event streams and uncover causation and we’ll consider how this benefits an organization’s overall cybersecurity posture.

The State of Incident Management

The status quo of incident management is that it’s wrought with suboptimal, manual efforts, explained Singh. The mean time to detect incidents—whether security breaches, downtime or some other error—can be quite high for DevOps or site reliability engineering (SRE) teams. Of course, users might complain of outages on social media, but depending on third-party outlets for reports is not ideal.

Another element is the complexity of modern software stacks. No one person has a total picture of the landscape and, as there are so many components, the environment can act in unforeseen ways when things go wrong. This also means there is a lot of data to analyze, which could cause alert fatigue. Most observability tools are designed to present an answer when you know what to ask, added Singh, but you often don’t know what you’re looking for or what to ask in the first place.

How ML Can Help

Moving from detection to remediation involves a lot of troubleshooting. Singh described the root cause analysis process as “hunting for the unknown” since it has many uncertainties. Machine learning, he said, could help correlate symptoms with causes in a much more transparent way. Here are some specific ways ML could enhance incident management efforts.

Automated Anomaly Detection

The great thing about new advances in LLMs is that they excel at analyzing large quantities of data for anomalies. Therefore, you could use a large language model to analyze a sequence of log events to spot issues, said Singh. Since incidents typically produce an unusual event in the log stream, running LLMs over these outputs across the stack could immediately source problems with infrastructure, the data pipeline, containers, software bugs or even third-party APIs.

Generate Root Cause Reports

Another area where LLMs could aid incident management is aggregating data and correlations between errors to produce concise root cause analysis reports. For instance, the system might analyze the event streams and discover network corruption caused by chaos engineering. According to Singh, LLMs could analyze the situation and develop a simple description that summarizes what’s going on in a distilled version with relevant keywords.

Natural Language Descriptions

Leaning on LLMs to quickly spin up natural language descriptions is a significant benefit. Natural language can turn opaque log data into a more human-readable description with actionable remediation tips. This could significantly lower the barriers associated with correctly diagnosing and responding to incidents.

Automatically Trigger Resolutions

Taking this further, AI could automatically trigger the next steps to respond to incidents as they arise. “A common approach to accelerating incident management using machine learning is to use ML technology to diagnose the issue and compare with previously known resolutions that can be applied automatically,” said Wing To, vice president of engineering for value stream delivery platform & DevOps at Digital.ai.

Predictive Use Cases

While the above actions are retroactive, it’s good to acknowledge that machine learning could be used to shift security awareness left to avoid incidents in the first place. “ML and AI can be applied in a predictive way to understand the risk of a change failing in production so organizations can decide if a change should be made,” said To. “Similarly, when applications are in production, ML and AI can identify emerging patterns that predict if a major incident is about to occur again, enabling organizations to proactively address the situation before it becomes an incident.”

How to Implement ML for Incident Management

While certain anomalies are common enough across different organizations, Singh recommended training a bespoke ML model on your own data to deliver the best results. This could be accomplished using a tool like Zebrium, for instance, which would first hook into your data sources, analyze a Helm chart, talk to underlying APIs and understand the application’s ins and outs. ML is adaptable enough to uncover and understand log formats.

Next, you’d want to train the algorithm to create a baseline of typical behavior by ingesting large amounts of data. This could take anywhere from hours to days, said Singh. You’ll want to give the system enough time to correlate unusual events in service A and an error in service B. And since the modern software environment isn’t static, this training should be ongoing as the baseline shifts.

Next, the AI could begin to proactively catch issues and deliver root cause reports. But, it may still take time to continually improve your signal-to-noise ratio, said Singh. The goal is first to find errors, and the next is to find correlation and causation. Beyond that, AI could be used to automatically jumpstart remediation or trigger a runbook, but you usually need a human in the loop to ensure a safe response, said Singh.

Benefits of Using ML

Perhaps the topmost benefit of using machine learning in this fashion is increasing the speed of incident detection and resolution. Mean-time-to-remediation (MTTR) could be significantly reduced through automated analysis that discovers correlations and reduces confusion around many alerts. “MTTR reduction is where machine learning can help a huge amount,” said Singh.

Another potential benefit is inserting more automation into the remediation process. “The human’s job is never going to go away,” said Singh. However, anything repetitive with known responses, such as brute force attacks, could be tied to automated procedures. “If developers can take care of routine issues, then they can focus on designing better tools and workflows,” said Singh. Still, if the remediation flow isn’t wholly automated, LLMs could suggest helpful next steps such as upgrading, restarting or scaling up instances.

Drawbacks

Of course, no technology is a silver bullet, and using AI within incident management comes with some security concerns to consider. The most practical concern is regarding leaking personally identifiable information (PII), which could be included in logs. Therefore, additional investment into LLM security will be required to avoid leakages.

Another potential limitation of this practice is the difficulty of incorporating new ML-driven insights into pre-existing monitoring workflows that are more human-centric or use other tools. For example, how would new solutions mesh with Pager Duty alerts in Slack or workflows using Grafana data? It will take another step to integrate ML insights into the new security ‘war room,’ admits Singh. “When new AI or ML technology enters the marketplace, to take advantage of it, humans need to change their behaviors slightly,” he said.

To get around this limitation, Singh advocates using programmable ML-driven incident management platforms. Cloud-native, API-accessible AI will help extend these security insights to other realms and open the door for more partner integration potential between vendors.

Final Thoughts on Getting Started

There are a lot of procedures in the software development cycle that are operational in nature and which AI could undoubtedly accelerate—incident management is no exception.

Petr Baudis, CTO of Rossum, said incident management combines incident identification, incident analysis and solving and further preventing the incident. “Machine learning can dramatically help in all these phases,” he said. Yet, Baudis cautioned teams to take the right approach: “The first step is to think, ‘How can machine learning give my team leverage?’ rather than ‘How can machine learn to automate my team’s work?'”

As we’ve already discussed, security alarms can produce a lot of noise, and finding the signals that matter can be challenging. Therefore, Baudis encourages teams to choose ML tools that excel at anomaly detection. “Once the team gathers in a war room to investigate the incident, they need excellent observability capabilities to trace the symptoms to a root cause (the signal) while ignoring flurries of related but distracting alerts that start overwhelming the system (the noise).”

The hope is that ML can highlight true signals out of the noise, even in highly complex software ecosystems. To enact this, he recommends considering ML features within your existing observability tools that actually “learn” through pattern matching. “Our favorite tool at Rossum to assist with incident management is Lacework,” added Baudis. “Their Polygraph engine is an excellent example of applying machine learning to find signals in the noise.”