Monday, October 20, 2008

OODA at play

Imagine the following scenario:

You've identified a system communicating with an botnet C&C over IRC. This system happens to be a system that should never be communicating via IRC. It's a webserver. It's running multiple Vhosts and has multiple IP addresses. The IRC connection is active. You check in with the system administrator and inform him of the situation. You discover that the webserver is merely serving public data. It doesn't process or store sensitive information. It's a good case for root cause analysis, eradication and rebuilding.

The system administrator calls you back and says it looks like SSH binaries have been replaced on the system. The administrator happens to be running cfengine and informs you that a large amount of systems have had ssh binaries replaced. What was a run of the mill investigation and analysis just blew up and turned in to an incident for which there is no playbook. Friends, this is a triage situation.

Triage is less about solving the problem as it is about prioritizing systems and stopping the bleeding to buy time to properly assess the situation, and react appropriately. The problem with triage is business continuity. Triage situations would be much easier if we could identify all of the affected systems, contain systems based on priority and threat, and move to more thorough response and analysis. Unfortunately we can't do that. The systems that need to be contained more often than not, can't be contained because they are critical to operations, meaning they can't be shut down.

Returning to the incident at hand. Over 50 systems have had SSH binaries replaced. At this point we need to triage the situation. Were we dealing with human beings, this would be a mass casualty incident and a methodology called START is applied to the situation. When dealing with human beings in an MCI, the priority goes to the most critical patient that can't survive long without immediate treatment. The job of the people performing triage is to assess only. No care is provided except opening airways and tending to patients that are bleeding severely. A good starting point is here. People get classified in to the following categories:


There's a lot that can be taken from this type of real triage in a mass casualty situation and applied to Incident Response when dealing with a lot of systems.

What kind of systems do we typically come across? Let's use the incident I mentioned above. Assume 50 systems. Assume the attacker is actively attacking and compromising systems. There are obvious limitations to physically visiting each system. So what can we do? Assess the situation from the network. In a few easy steps we can triage the situation. With 50 systems its rare that you would find different attackers and different methodologies being used against you. So, we make an assumptive hypothesis based on the following premise. Cfengine detected ssh binary replacement on 50 systems, therefore the attack signature will be similar across systems. In addition, we can assume that very few remote systems will be used in such an attack. So what can be done to triage?

We can quickly divide the systems in to the following categories:

4) Systems that can't be blocked at the perimeter
3) Systems that can't be taken offline (network or power)
2) Systems that that can be blocked at the perimeter (internally critical systems)
1) Systems that can be taken offline (network or power)

Now you might be asking why is priority 1 a system that can be taken offline rather than the system that can't be taken offline? The idea is simple. If I can take it offline, then I should do so by whatever means are necessary. If I can't take the system offline, the task of response is more advanced. Assign the system administrators or other tech staff the role of identifying and containing the systems that can be quickly contained. The idea being that if you are hemorrhaging from 50 holes, and can close 30 of them then you've cut down the tedious work by 60%. Get them under control and off of the immediate concern list.

If I can block a host at the perimeter, then I should do so, quickly. This is a solution that can work to directly cut off the attacker, however with so many systems, there is no way to guarantee the effectiveness of this type of action. An indirect attack is still very possible. Sometimes though, you just have to make a decision, and adapt.

If I can't take a system offline, and I can't block it at the perimeter then I need to respond quickly and carefully. These are the business continuity cases that hamper triage and response. So what can be done to triage them? Remember we're buying time, not solving the problem 100%.

If we work based on our assumptive hypotheses, we can enable a perimeter block to stop the remote sites from being accessed by any of the compromised or soon to be compromised systems.

Have you noticed the OODA loops?

As systems are being contained - via network blocks and physical containment - more compromised systems begin actively attacking. Port scanning begins on internal hosts. Initial triage, while containing 60% of systems left an opening. Once again, division of forces is key to success. With two IR staff, one can work on active containment, while the other works to gather more intelligence.

F-response is a fantastic intelligence gathering tool in this case. Using it, a remote system can be analyzed in real time. Connecting to a system, being able to identify the files created/modified/accessed during the attack lends itself to a more rapid action cycle. Combined with traffic captures and live access to disk based data, we can break in to the OODA loop of the attacker. We can predict what the tools being used will be used for and what files get replaced. We can predict what the attacker will do at each step and can develop a rapid active response to stop him before he begins. With the situation unfolding and new information, further containing the systems that couldn't be taken offline or blocked at the perimeter becomes simple. With a tool like cfengine, a few commands can remove the active threat and we can continue working the problem.

As the situation is contained, a signature is developed and active monitoring is implemented to watch for other systems showing signs of intrusion.