Apollo Root Cause Analysis

Español | Svenska | Nederlands | Português     Login | Create Account

Search:

IT incident management: Continuously improve response to problems

What are the consequences of putting a patch on an IT incident rather than conducting a formal root cause analysis?  On the other end of the spectrum, what are the consequences of formally analyzing every little thing?  Time, money and other resources are at stake...along with your ability to meet Information Technology Infrastructure Library (ITIL) and Service Level Agreement (SLA) expectations.

Catastrophic failures with enterprise-wide impacts are obvious and are easily identified as problems requiring formal investigation.  Fundamentally, though, many IT organizations struggle to define the difference between some types of incidents and problems, and grapple with how to classify and respond to these unwanted service interruptions appropriately.

The key is to focus on escalating incidents that are preventing your organization from achieving specific goals or targets.  One you have outlined your first draft of threshold criteria (as explained in our previous IT blog):

  • Continuously improve your approach to RCA over time. Setting and working with threshold criteria in problem management is a fluid process.
  • Take a look at the organization's history. Forecast what your group might be faced with in the forthcoming year. If a threshold for a given goal is set at level Z, how many major incidents would you expect to be triggered in a given year ifZ defined a major incident? Ask yourself whether your team can handle that number of investigations per year. If not, then you may need to raise the threshold criteria to be in balance with your team's capabilities to conduct RCAs.
  • Review threshold criteria periodically. Adjust up or down and add new ones as needed. Threshold criteria should be in-sync with changing organizational activities/situations.
  • Consider human bandwidth. Scale the threshold criteria to match your team's bandwidth to complete formal investigations on time. If the threshold is set to look good on paper and satisfy company "politics," the numbers of actual investigations triggered by that threshold may overwhelm your capacity to investigate each problem.
  • Don't sacrifice quality. The quality of the investigation is often sacrificed to meet the requirement to close out an investigation by a certain deadline. This situation typically leads to ineffective solutions and the recurrence of problems. Evaluate whether it's worth reconsidering deadlines for the sake of RCA quality and problem elimination.
  • Adjust your threshold criteria as you experience fewer problems. In the first few years of building your problem management process, you may have a high number of RCAs generated in any one year. But if you start to prevent some of those problems from happening again, using a robust RCA tool you can start eliminating systemic issues that reduce future risk. Then the number of RCAs your group will face is likely to decrease as your program matures. We'll cover systemic issues in more depth in a future blog.
  • Demonstrate the effect that formal problem management is having on the organization's bottom line. For instance, 50% reduction in events of type A over the last year. We'll cover the bottom line more in a future IT blog.

Share your experiences and best practices.  We're building an online community to support professional development and continuous improvement.

Posted 31 March 2010 by Brian Hughes, Apollo Vice President

Comments

No comments yet. Be the first to comment on this article.

Post Your Comments

* Required fields