Investigation process for IT incidents

It is inevitable that things will fail. And while working with IT infrastructure it is inevitable that failures lead to complex investigative scenarios.

This happens not only because we have to manage a huge number of services and servers, but also because the complexity of apps has increased with time.

Microservices, clouds, complex network environments, high performance databases, hardware, operating system, file system tuning. And also knowledge on applications and services used. Middleware, front end, back end, algorithms.

A bit of everything so that we can have a broad view of the environment we support to be able to direct and question technical teams that are performing the analysis and investigation processes.

Investigation flow

There are some phases we go through while investigating an incident.

Understanding the incident
Understanding the incident impacts
Gathering evidences of the incident
Investigating the correct start time of the incident
Collecting information from equipments and infrastructure
Having a meeting room dedicated to the incident
Having the right teams in the meeting room
Formulating hypothesis of the root cause (see the Ishikawa diagram) !Root causes - Visual Frameworks - IMG_1317-1024x568.jpeg
Testing those hypotheses
Implementing the solution
Document all actions taken and their results
Updating the knowledge base for future uses