Troubleshooting and solving problems in Operations – A pragmatic approach

By definition those of us in IT operations and technical support spend a significant part of our time solving problems for our users and customers. No great revelation there.

Don’t confuse my use of problem with the capital P Problem as defined in ITIL. What I mean by problem is anything IT related that is adversely impacting a customer.

Does every lower case p IT problem benefit from following a systematic approach? My experience leads me to believe so. So did some smarty pants MIT folks. See the 1995 paper Systematic versus Intuitive Problem Solving on the Shop Floor: Does it Matter. The context of this paper is manufacturing but the concepts discussed are universal.

There is no one right approach. Every method will miss steps, have steps in the wrong order, or have steps impossible to complete for some problem domains. The important thing is to establish and evolve an approach that works for you and your team. Be creative and open-minded. Matching your methodology to the creativity, skills, and experience of the team is more important to problem solving than the details of the methodology itself. There is no substitute for developing expertise in the environment, platform, or application. One possible set of steps in one possible order:

Describe the problem and gather initial data from the user
Triage and prioritize the problem with the team
Establish a repeatable scenario that results in the problem happening
Engage subject matter experts
Gather and share detailed information from the IT systems
Identify and evaluate possible causes to the problem
Determine and evaluate the possible solutions to the most likely causes
Develop an action plan to apply the possible solutions
Implement the first preferred solution in the action plan
Repeat as needed

Describe the problem and gather initial data

Identify the application or service that the user is attempting to use and the specific steps the user takes
How the problem impacts the user and the circumstances in which it happens. Be as specific as possible
Define what should happen if the problem doesn’t exist
Establish the problem scope. Does the issue happen for just one user, all users, users at a specific site, etc
How is this problem impacting the business

Most organizations probably have standard SOP’s for initial data gathering from the user. If your org. doesn’t then establish some.
Examples:

When did the problem first happen
On what device does the problem happen
What network is the device attached to
If the user has access to multiple devices does the problem happen on all the devices
Is the issue with a specific application or specific application component
…

Triage the problem

Developing the ability to triage an IT problem is a critical skill for IT service support teams.

Despite the best efforts of an IT team, there will always be recurring problems that for whatever reason are not completely removed from the environment or the platform. Or perhaps it a user training issue. At this point in the process if the issue is one of these then it will be positively recognized. So don’t waste time. Get the issue to the right team to execute the known solution. Or get the right information to the customer and ensure the customer moves past the problem.

Establish a repeatable scenario

This is a critical step. Every troubleshooting procedure should include this.

Examples:

Clear Chrome browser cache. Use Chrome to browse to a particular website URL and observe the 500 server error message
Unlock user account on Friday at 5pm. User attempts first log in on the following Monday at 8am and account is already locked
User establishes RDP session to server and attempts to launch application from the taskbar. Application fails to launch.

Understanding how to recreate the problem is essential to gathering detailed data from IT system logs and developing and testing a solution.

Engage subject matter experts

We have all joined troubleshooting calls only to sit unneeded as the problem was worked to a solution by SME’s for other components. That’s always possible. It’s up to whoever performs the triage and initial information gathering to determine who should be engaged. If the problem is impacting important business functions, error on the side of caution. Now is the time to start keeping a timeline of the troubleshooting steps.

Gather and share detailed information from the IT systems

Change data for the impacted systems. Recent upgrades, patches, configuration changes, etc
Recent incident records related to the impacted systems or impacted user
Log data that corresponds to the time frame of the problem. Easier said then done I know. Use the repeatable scenario to create new error log entries, hopefully
Other data as the SME’s recommend

Determine and evaluate the possible solutions to the most likely causes

This is the creative part. The more potential solutions the more likely is success. Experienced judgement is required to determine how long to spend on this step.

Suggested rules for the discussion:

Everyone needs to be able to say what they think
Listen to all ideas
Shared ideas belongs to the team, who discusses them until they can either prove them to be valid or reject them
Establish a set of actionable decisions for what to do, by whom, and by when
Discuss negative impacts to the solutions
Discuss how to back out the solutions

Nothing new here. These meeting rules were made popular by Toyota.

All solutions are not created equal. For example, restoring service might be accomplished by restoring a server or application from a last known good recovery point. But that solution could cause loss of data that might create unacceptable damage to the business. Some solutions take longer than others. If restoring service is critical the team should consider short term actions to restore service even if it means the problem is likely to reoccur. A site to site VPN fails. The SME knows how to immediately rebuild the tunnel and that it will stay up for 8hrs and then it will drop again. The service is critical now to some on going activity, so the plan could be to (1) restore service now, (2) automate the tunnel rebuild actions, (3) perform detailed investigations to determine root cause and solution outside of the time constraints of a critical response. Preventing problem recurrence may be secondary to timely restoration of service.

This part of the process may require input from business stakeholders. Don’t be shy about asking for this.

Implement the first preferred solution in the action plan

Rubber meet road. A repeatable scenario required for this step.

Keep up the timeline documentation
Verify that the problem has been corrected
If the problem persists determine if the corrective action should be backed out
Implement the next preferred solution in the plan if needed

Repeat as needed

Failure is not an option.

Celebrate Success

This last bit isn’t strictly part of the troubleshooting activity but is important. Acknowledging resolution of a problem should be as formal or informal as appropriate.