Thursday, July 23, 2009

Lessons learned - a menagerie

While writing up a paper the other night I got inspired to share some things...some lessons learned from incidents over the past year. Here's to hoping this helps or entertains.

Communication needs to be accurate and timely

When your IRT is in the middle of a widespread incident and you need to notify the organization at large, the information must be accurate. Tech support - your boots on the ground - needs accurate information to take remediation steps at the micro level. This information must also be communicated in a timely manner. At least two communications need to go out within the first 24 hours. One to alert the organization, and the second to provide a status update.

SITREPS are valuable

When you or your IRT are dealing with an incident it is vital to provide Situation Reports or SITREPS to your client and managment. The frequency and depth of these SITREPS can be determined by the scope and severity of the incident.
A simple chart like this helps:
Tier 1 Incident - SITREP ea. 1-4 hrs.
Tier 2 Incident - SITREP ea. 8 hrs.
Tier 3 Incident - SITREP ea. 24 hours.

SITREPS should contain the following information.
Who is doing What, Where there are doing it, When it will be done.
Assessment of the situation
Updates on old news
Updates on new news

Partnerships work well in a distributed environment

When you are the incident manager and you do not have full authority over a distributed environment, you must partner with the people in charge of the distributed environment. This is the only way to be successful in a crisis situation. The incident must become everyone's problem with the seriousness being communicated effectively.

Tech support and end users are like eye witnesses

70% of what they tell you will be incomplete, misinformed or just plain wrong.

There will always be information that would have been helpful yesterday

Incidents do not always go perfectly. You will never have the full picture when you need it. Gather what information you can, assess the collected information, and make a decision. Adaptability is one of the key traits of a good incident responder.

Stop trying to prevent the last incident and focus on the next incident

Often times after a significant incident and organization will enter a tailspin trying to solve the last incident. Numerous resources will be poured into making sure 'it never happens again'. The reality of the matter is that it will happen again, just not in the same way. This is why incident follow up is important. After an incident, you do need to address the Root Cause but you need to look forward to the next incident and begin preparation. As a former coach once said "don't stand there and admire the ball after you shoot, keep moving"

In 30 years of computing the security industry has never solved a problem

Every time I go out on a call I am reminded of this nasty little truth. The security industry has never solved a problem. Imagine taking an exam with 8 non-trivial proofs. You are expected to complete them in 30 minutes. This is an almost impossible task. My money is on an incomplete exam and mistakes in the proofs you have attempted. Due to the constant evolution in the technology world, problems never get solved and history repeats itself frequently. It is because of this that Incident Responders should keep current, and pay attention to history.

Don't be afraid to say you don't know

This one is tough for a lot of people to digest. People seem to want the wrong answer instead of a non-committal one. There is nothing wrong with not knowing everything. Better to not know and find out, than to appear to know and show yourself to be wrong later.

Due Diligence is not the same as Investigation

If you are approached by a client and they engage you to perform a task to do their due diligence, it is not the same as investigating a matter to search for the truth. Those that want due diligence are simply looking to CYA. Those that truly want an investigation will be in search of root cause, impact, and conclusion.

Routine Investigations only exist in news articles

Every investigation this past year has been different. The only thing routine about an investigation is the tools and process used. Nothing takes 5 minutes, and getting to point B is never a straight line. Commit your tools and process to memory and train yourself and your team. This way when the investigation changes course you can adapt easily.

Establish working relationships with key vendors you rely on, and customers that rely on you

Incident response is a two way street. If you have a product that your organization relies on to conduct operations, ensure you have a strong working relationship with them. Meet with all vendors at least once per year, if not more. This pays off for both sides and keeps both sides informed of needs and opportunities. In a time of need, you will want that vendor on the phone assisting you with their product. Likewise, if you are serving a client, you want to have a good relationship. Visit your clients when there is not a crisis. This lowers stress and fosters trust and respect.

Don't hold on too tight and remember to breathe

When functioning at a high operational tempo for extended periods of time, you will experience burnout. As a result, efficiency and productivity decreases drastically. Know yourself well enough to know when it's time to decompress and give yourself some breathing room. If you manage a team, take your team out for drinks and laughs once in a while, send people to training, give them comp time. Do anything and everything to keep yourself and your team operating at peak performance levels.

Incident detection should not overwhelm analysis capabilities

When you are drafting budgets or you seek funding for projects that involve incident detection, you should try to remember that incidents require resources to respond to and ultimately analyze data. When detection overwhelms your ability to analyze incidents you experience backlogs and rash decision making. Remember that an analysis takes approximately 20-40 hours on average and a good analysis can not be rushed. Keep analysis requirements in mind any time you are looking to improve your detection. Great, you detected an incident, can you respond to it and analyze it?