Investigation process for IT incidents
It is inevitable that things will fail. And while working with IT infrastructure it is inevitable that failures lead to complex investigative scenarios.
This happens not only because we have to manage a huge number of services and servers, but also because the complexity of apps has increased with time.
Microservices, clouds, complex network environments, high performance databases, hardware, operating system, file system tuning. And also knowledge on applications and services used. Middleware, front end, back end, algorithms.
A bit of everything so that we can have a broad view of the environment we support to be able to direct and question technical teams that are performing the analysis and investigation processes.
Investigation flow
There are some phases we go through while investigating an incident.
- Understanding the incident
- Understanding the incident impacts
- Gathering evidences of the incident
- Investigating the correct start time of the incident
- Collecting information from equipments and infrastructure
- Having a meeting room dedicated to the incident
- Having the right teams in the meeting room
- Formulating hypothesis of the root cause (see the Ishikawa diagram at Root causes – visual frameworks)
- Testing those hypotheses
- Implementing the solution
- Document all actions taken and their results
- Updating the knowledge base for future uses