A good incident response plan entails key roles, specific plans and playbooks, and well-rehearsed actions with clear and precise communication. The best incident response plans are those written in pencil, well-revised and deeply familiar to those needing to be on the frontline of responding.
There are a few crucial aspects to any incident response plan, the first being assessment. Take an inventory of what has happened. Is this a false positive? Has the system been compromised, or is this just an attempt? You need to understand what has happened and is happening to know what to do about it. At a technical level, this often comes down to logs – making sure you have robust logging in place across all systems is a very common “finding” from companies which have recently suffered a cyber attack as they review the incident in retrospect. You might think you do good logging. Put it to the test.
The next element of an incident response plan is the containment. Just like responding to a public health emergency, you must understand the origin and isolate the infection. If an attacker has breached your systems, identify those access points, and remove them from the internet. Block communication with their command-and-control servers. This is often easier said than done, but this step of the process will separate mature organisations from the poorly prepared. Mistakes happen, and as we’ve seen over the past few years, phishing and social engineering tactics remain highly effective, even in secure organisations like major identity providers. The key is how quickly you are able to stem the tide of damage.
The third element of a good incident response plan is building the team to take care of the response. Sometimes these are specifically dedicated staff with IR as their sole responsibility, and other times these are staff who get authorised to do an IR role if and when incidents occur. Both are just fine, as long as the team is prepared and well-practised in their roles. One critical part of this is the identification of an Incident Response Lead, a single manager who is the sole decision-maker for the incident and is empowered by technical leadership to act quickly. This helps ensure there is no question in terms of authority to make difficult decisions, as these often need to be made in very time-critical situations when normal reporting lines won’t work.
Once you have your team, with a dedicated incident management leader, you’ve assessed the situation and contained the outbreak, next comes disaster recovery. For organisations I work with, I advise running as many automated backup/recovery processes as possible—for example, every week have an automated task that takes a web application, builds a new version from source code, spins up the associated infrastructure, copies the database from a backup, and then launches, with traffic pointed to the new system. From a software development perspective, this is where infrastructure as code comes into play. You want the ability to have these plans largely written in code and able to be run on a regular basis with execution reports and (hopefully) success metrics to share with senior technical leadership.
One of the most often overlooked parts of an incident response plan is the practice of it. Too often, organisations build plans but do so with teams looking only at disaster recovery or risk management. Those plans are then kept in a silo away from the teams that actually need to implement the plan when an incident arises. And let’s face it, with the increasing frequency of attacks and instances of compromise, we need to shift our mindset from an “if a compromise happens” to a “when a compromise happens”. Acknowledging that a compromise will happen is the first step towards a stronger security posture and a more resilient organisation.
The next step is building a plan that you can practice—and you do need to practice it. Start with a small plan covering the basics: How are you handling backups? Are these kept in a separate system from your general IT and production environments? How quickly can you restore from these backups when the need arises? Have you tested that?
When it comes to measuring an organisation’s maturity in dealing with incidents, there’s a really useful lesson to draw on from the discipline of site reliability engineering. There is a concept called ‘chaos engineering’, which is effectively deliberate fault testing of systems to understand their breaking points and build them back better. Engineers will test the resilience of clusters of virtual machines or containerised services by intentionally knocking some of them over via unintentional load or sometimes a security error and seeing how well the overall web of systems is able to rebalance and repair itself. We can apply this same principle to incident response. Using this model, try having a team different from that responsible for disaster recovery come up with some measured bits of “chaos” to inject into your tech and see how well your plan works in practice. These are most informative when you have a “no-notice” approach, meaning the teams doing the incident response don’t know this is a drill until after they have finished the work. This may sound stressful, but if you’ve run successful practice runs already, this is a logical next step that will greatly improve your organisation’s ability to respond to the unknown.
Getting leadership involved in disaster response practice sessions can make all the difference when it comes to support from the C-suite. It is important to involve non-technical leaders in disaster planning and incident response work by conducting tabletop exercises or actual response plan testing. This direct involvement from the board enables them to understand the gravity of the situation, endorse the plan of action and make sure the programme gets the institutional support and funding that it needs to be successful when cyber incidents arise.
Elliott Wilkes is chief technology officer at Advanced Cyber Defence Systems

