Troubleshooting and solving problems in Operations – A pragmatic approach

By definition those of us in IT operations and technical support spend a significant part of our time solving problems for our users and customers. No great revelation there.

Don’t confuse my use of problem with the capital P Problem as defined in ITIL. What I mean by problem is anything IT related that is adversely impacting a customer.

Does every lower case p IT problem benefit from following a systematic approach? My experience leads me to believe so. So did some smarty pants MIT folks. See the 1995 paper Systematic versus Intuitive Problem Solving on the Shop Floor: Does it MatterThe context of this paper is manufacturing but the concepts discussed are universal.

There is no one right approach. Every method will miss steps, have steps in the wrong order, or have steps impossible to complete for some problem domains. The important thing is to establish and evolve an approach that works for you and your team. Be creative and open-minded. Matching your methodology to the creativity, skills, and experience of the team is more important to problem solving than the details of the methodology itself. There is no substitute for developing expertise in the environment, platform, or application. One possible set of steps in one possible order:

  1. Describe the problem and gather initial data from the user
  2. Triage and prioritize the problem with the team
  3. Establish a repeatable scenario that results in the problem happening
  4. Engage subject matter experts
  5. Gather and share detailed information from the IT systems
  6. Identify and evaluate possible causes to the problem
  7. Determine and evaluate the possible solutions to the most likely causes
  8. Develop an action plan to apply the possible solutions
  9. Implement the first preferred solution in the action plan
  10. Repeat as needed

Describe the problem and gather initial data

  • Identify the application or service that the user is attempting to use and the specific steps the user takes
  • How the problem impacts the user and the circumstances in which it happens. Be as specific as possible
  • Define what should happen if the problem doesn’t exist
  • Establish the problem scope. Does the issue happen for just one user, all users, users at a specific site, etc
  • How is this problem impacting the business

Most organizations probably have standard SOP’s for initial data gathering from the user. If your org. doesn’t then establish some.
Examples:

  • When did the problem first happen
  • On what device does the problem happen
  • What network is the device attached to
  • If the user has access to multiple devices does the problem happen on all the devices
  • Is the issue with a specific application or specific application component

Triage the problem

Developing the ability to triage an IT problem is a critical skill for IT service support teams.

  • Despite the best efforts of an IT team, there will always be recurring problems that for whatever reason are not completely removed from the environment or the platform. Or perhaps it a user training issue. At this point in the process if the issue is one of these then it will be positively recognized. So don’t waste time. Get the issue to the right team to execute the known solution. Or get the right information to the customer and ensure the customer moves past the problem.

Establish a repeatable scenario

This is a critical step. Every troubleshooting procedure should include this.

Examples:

  • Clear Chrome browser cache. Use Chrome to browse to a particular website URL and observe the 500 server error message
  • Unlock user account on Friday at 5pm. User attempts first log in on the following Monday at 8am and account is already locked
  • User establishes RDP session to server and attempts to launch application from the taskbar. Application fails to launch.

Understanding how to recreate the problem is essential to gathering detailed data from IT system logs and developing and testing a solution.

Engage subject matter experts

We have all joined troubleshooting calls only to sit unneeded as the problem was worked to a solution by SME’s for other components. That’s always possible. It’s up to whoever performs the triage and initial information gathering to determine who should be engaged. If the problem is impacting important business functions, error on the side of caution. Now is the time to start keeping a timeline of the troubleshooting steps.

Gather and share detailed information from the IT systems

  • Change data for the impacted systems. Recent upgrades, patches, configuration changes, etc
  • Recent incident records related to the impacted systems or impacted user
  • Log data that corresponds to the time frame of the problem. Easier said then done I know. Use the repeatable scenario to create new error log entries, hopefully
  • Other data as the SME’s recommend

Determine and evaluate the possible solutions to the most likely causes

This is the creative part. The more potential solutions the more likely is success. Experienced judgement is required to determine how long to spend on this step.

Suggested rules for the discussion:

  • Everyone needs to be able to say what they think
  • Listen to all ideas
  • Shared ideas belongs to the team, who discusses them until they can either prove them to be valid or reject them
  • Establish a set of actionable decisions for what to do, by whom, and by when
  • Discuss negative impacts to the solutions
  • Discuss how to back out the solutions

Nothing new here. These meeting rules were made popular by Toyota.

All solutions are not created equal. For example, restoring service might be accomplished by restoring a server or application from a last known good recovery point. But that solution could cause loss of data that might create unacceptable damage to the business. Some solutions take longer than others. If restoring service is critical the team should consider short term actions to restore service even if it means the problem is likely to reoccur. A site to site VPN fails. The SME knows how to immediately rebuild the tunnel and that it will stay up for 8hrs and then it will drop again. The service is critical now to some on going activity, so the plan could be to (1) restore service now, (2) automate the tunnel rebuild actions, (3) perform detailed investigations to determine root cause and solution outside of the time constraints of a critical response. Preventing problem recurrence may be secondary to timely restoration of service.

This part of the process may require input from business stakeholders. Don’t be shy about asking for this.

Implement the first preferred solution in the action plan

Rubber meet road. A repeatable scenario required for this step.

  • Keep up the timeline documentation
  • Verify that the problem has been corrected
  • If the problem persists determine if the corrective action should be backed out
  • Implement the next preferred solution in the plan if needed

Repeat as needed

 Failure is not an option.

Celebrate Success

This last bit isn’t strictly part of the troubleshooting activity but is important. Acknowledging resolution of a problem should be as formal or informal as appropriate.