Incident Response Metrics

Having led a number of Incident Response Teams focused around APTs for the last couple of years, I’ve seen my fair share of interesting things. APT incidents are, by nature, going to be a much lower volume occurrence than your non-targeted malware incidents- but the decisions you are making have a much larger impact. While there are many security folks who do not believe that you need to have metrics data surrounding incidents, I disagree. IR Metrics help leadership understand the real story, drive process improvement and assist in directing spend by providing insight into what your future defense, detection, and response capabilities should look like. The following are some of the key operational metrics I find most important.

Dwell Time – Dwell time is the concept that a host is compromised at some point in time and this compromise is not detected until some further point in time. For many organizations that are just standing up IR capabilities to deal with APT, your Dwell Time may be astronomical. This is a fact because you have most likely been owned for years and as your capabilities to detect compromises come online, you may have 1 year, 2 year or even higher lengths in Dwell Time. It is important to understand that in the beginning this metric will not mean much. However, as time advances, along with the maturity of your IR program, the Dwell Time should begin to trend downwards. Ultimately, with a clean environment and proper detection in place, your hope is to get that Dwell Time down to weeks, days, hours- let alone minutes. A continued decrease in Dwell Time is a direct correlation of a successful and mature IR program. Also note that this metric may take year(s) to start showing success as an IR program does not mature overnight.


Containment Time – Containment Time is another key metric that should be measured. This basically boils down to how quickly I can mitigate the risk posed by this compromised asset to any other asset on my network. There are a lot of different camps on this topic. Some believe that keeping hosts online and understanding the full extent of the compromise before doing a mass contain is the optimal solution. Others believe that playing ‘whack-a-mole’ and containing assets the moment you identify them is the optimal solution. Still others believe the answer lies somewhere in the middle. With respect to any philosophy, measuring your Containment Time will show you how well your organization is maturing. Notice also in the below chart that I have added an SLA line- having goals for your organization, regardless of your philosophy is important- and utilizing the metrics I’m outlining will allow you to easily measure them. Also note the outliers called out below. Those are the issues you need to take a look at individually and determine what the root cause of not hitting that SLA was. Many times you’ll find that something went amiss and/or there are changes needed to the process to ensure a problem is not repeated.


In regards to the timeline of the event, you start to have something along the lines of this graphic, which illustrates a linear approach to measuring Incident Response with critical metrics leadership is generally most concerned with.


Collect Time & Volume – Regardless of the philosophy you follow in regards to containing incidents, you will need to gather forensic evidence to help understand the extent of the compromise. In my experience, I’ve found that in the entire response process (after detection), evidence collection has the longest cycle time and will directly impact your ability to flush out the details as quickly as possible. Without tracking your Collect Time, as well as particular data about each collection attempt, you will not know where you can improve. For example, have you ever tried to collect data from a user in Indonesia? How about someone that is on vacation for two weeks? What about over VPN in South Africa? Without both measuring the time to collect data, as well as diving into the details of the outliers, you will not understand where you can attack the process and remove waste. Additionally, understanding the largest bottleneck in the process will allow you to appropriately set management expectations as to when more details will be forthcoming.


Analysis Time & Volume – One of the coolest parts of IR is the analysis of forensic data. Analysis, while rooted in science, is still a bit of an art form; however, this does not mean that it still can’t be measured. There are many steps and tasks that can be automated to drive down Cycle Time and once again, without understanding where you are starting from, you can’t measure your improvement. Additionally, some key data you may notice when you track this is an uptick in Cycle Time during an incident. This may lead you to seek out more resources if your organizations appetite dictates a faster response. I’ll also add that given the resource intensive nature of performing analysis, buying beefier hardware for your analysts will almost certainly guarantee a decrease in Analysis Time. Other areas you may notice CT decreases would be in automating parts of the analysis program. Also, as a tidbit of wisdom, one of the most brilliant analysts I’ve known has told me, “It takes a lot more time to prove a negative than to prove a positive.” – keep that in mind, and do not leverage this metric to put pressure on your analysts performing the work- they want to know the answer just as quickly as leadership does!


When considering the above measurements in aggregate, one method I’ve seen that displaying them visually is in the concept of the timeline. While I’m not a huge fan of this method, it is useful for a one off look at where you started and where you currently are.


Detection – Detection is probably one of the most critical pieces to measure in the organization because unlike the previously mentioned steps in IR, detection usually involves very expensive tools that can quickly drain a budget. For example, if you had a chart like the below, what could you learn from it? You may want to understand why Tripwire isn’t performing like you thought, as well as perhaps taking a look at MIR- how much are you spending on those products per year? Are we using our tools? Are we tuning all of our tools properly? Do we need more training? Should those dollars be better spent elsewhere? Why is IDS so successful? Can the tools handle the indicators our Intel teams are gathering? Etc…


False Positive Rate for Detection – In Incident Response, we do a lot of what I call, “turning over rocks,” looking for anomalies in the environment and these hits are driven from our detection tools. What is important to know is how good our tools are at finding things and that they aren’t simply creating unnecessary work for us. Looking at the chart below you can begin asking a lot of questions, especially when coupled with the previous chart. Perhaps AV isn’t a great detection source, perhaps MIR needs better written IOC’s, and perhaps we need to invest more into IDS and/or learn from the successes there. There is a great post on Precision by @chrissanders88 at that goes into more detail around False/True Positives and False/True Negatives, which is definitely worth a read.


In conclusion, Incident Response metrics are imperative to the success of any security (or IT) organization and the above are just a handful of useful ways to leverage them. Metrics allow leadership to make decisions based on data and facts, and allow for the removal of emotion and anecdotes from critical decision making processes. By implementing an IR program built around the metrics mentioned above, the IR leader will be able to show the maturation of the organization and processes, pinpoint areas of focus for process improvement and identify gaps in prevention, detection and operational capabilities. Without building a metrics program and basing decisions on it, you delay improving upon your capabilities and open up the possibilities to misdirect spending and ultimately, when dealing with APT, these decisions can lead to substantial impact.