IT problem management: When does an interruption in IT service deserve formal investigation?
When I recently presented at the itSMF Fusion conference about our work with Boeing, and at the ProjectWorld / Business Analyst World conference with Bell Canada, I was often asked by IT professionals, "How do we differentiate between an "incident" and a "problem"[1] that requires a formal root cause analysis?"
Many emerging ITIL programs either don't have a formal root cause tool in place or take an ad hoc approach to identifying problems. Either way, most don't know how to separate incidents from problems in order to right-size the investigation response, so that time and money is not spent needlessly.
It is most important to ensure that your root cause analyses are focused on the events that have -- or could have -- the greatest impact on the most important functions of your organization. That way, your RCA effort is aligned with the organization's business goals. To differentiate between incidents and problems -- or incidents and major incidents -- draw a line in the sand where the cost of the unwanted event outweighs the cost of a formal investigation. Set threshold criteria that specifically outline the kind of response that is appropriate for different scenarios, make it easy for anyone including help desk staff to make the decision, and remember to clearly outline the rationale to justify your criteria.
One way to create threshold criteria that define what Problem Managers should formally investigate would be to set up a table.
- In column 1, list business goals, targets, key performance indicators and objectives. These can be set up at all levels of the organization including corporate, business unit and project.
- In column 2, list how IT supports these goals or targets, such as services provided.
- In column 3, list how an IT-related event could compromise that goal or target. Specify the magnitude and direction of an event, such as "unplanned server outage more than one hour in duration during business hours." This third column essentially becomes the threshold criteria used to separate incidents from problems.
Engage people in the organization to help you define what level of service interruption or deviation could compromise the various business goals. Be aware that each person's definition of what constitutes a problem or major incident will be different.
Another way to approach problem threshold criteria, especially in larger organizations, would be to:
- Look at your service license agreement (SLA) and identify all the IT performance outputs - all the "IT will do X" statements.
- Outline "when X does not happen, then Y will happen." Determine what kind of response is appropriate when Y happens.
In order to do this successfully, you may need to more specifically define the general terms that are contained in many service agreements. For example:
- "Significant number of people affected." How many people?
- "Percentage of total tasks that can no longer be performed by individuals." What percentage? What tasks?
- "Significant impact on the delivery of customer product." What is a significant impact? One day late?
- "Significant risk to safety, law, rule, or policy compliance." How do you define significant risk? One un-licensed copy of a software program on a single desktop?
How does your organization decide when to perform a root cause analysis?
Question: What threshold criteria do you have in place?
Question: What best practices did you follow to you create them?
Question: Who leads the effort? What role to systems and business analysts play?
Share your experiences and best practices. We're building an online community to support professional development and continuous improvement.
Posted by Mark Hall | Canada Account Manager, Instructor, & Investigator | 12/2/2009
[1] Within the incident management process, an incident is defined as an event that interrupts or reduces the Quality of Service. An incident may affect a single employee (i.e., user) but it does not affect the overall business in a significant way. Within the problem management process, a problem is defined as a series of incidents that adversely affects the business because it affects a group of employees.

Comments
No comments yet. Be the first to comment on this article.