secture & code

Observability in Microservices

Working with microservices brings us many advantages, so many that we sometimes forget about the challenges it can bring.

From this arises the need to have more control and visibility of how our system behaves at all times.

Why is observability so important in microservices?

We managed to maintain consistency in our system.
We will be able to solve possible problems and errors in a more efficient way. nimble.
We obtain higher visibility y traceability of what is happening in our system.
We obtain the status of our system in real time and always available.

Within observability there are 3 fundamental pillars:

LogsThey store information about events produced in our services, they provide information about what is happening in real time. It is a way to obtain detailed information about the flow of your system in order to identify future errors.
MetricsStores performance data for each service, such as CPU usage or memory usage. With this we can identify anomalies and performance problems in our system.
TracesWe will be able to see the flow and the responses between our services. The traces are an important point since it gives us the possibility to find the focus of the error in a much more efficient way. We will be able to see in which specific service something is happening (whether it is an error or not). Even if the error occurred in another service, we will be able to identify where it originally came from.
AlertsThis point is essential in observability, since the alerts will be the ones that will keep us informed at all times of what is happening. Being these configurable and adaptable to our system.

Below we will see some of the possible patterns that can be applied in our system:

1. Health Check API

This first pattern is based on the implementation of a series of endpoints which will return the current status of our service, as well as the status of external services it depends on, such as RabbitMQ o Redis.

This is interesting when we have the case that our service is not able to process certain operations, such as the disconnection of our database while the service is still running.

We will have a endpoint that will give us the status of our database, which we will consult periodically to have a controlled status of the service. The responses of these endpoints can be represented in a dashboard.

2. Log Aggregation

In this case our services have to take care of generating a series of logs (based on our needs), which will be sent and stored in a service of logging centralized.

The stored logs may contain information about what is happening in our service, errors produced in it, warnings, etc.

From these logs, and using the logging, If you want to create alerts based on the results of the generated logs, we can create alerts based on the results of the generated logs. For example, we can generate an alert when X number of logs have been produced with a specific error message.

Some of the logging services that we could use are Sumologic o Datadog.

3. Distributed tracing

With this pattern we will have the possibility of having information of all possible interactions between services. We will be able to obtain a detailed trace of the complete flow.

In each request made by our system we must generate a unique ID. We will send this ID in the requests that we make and the events that we launch, in this way we will be able to identify it as a single transaction between the different services.

With this ID we will be able to search the related logs within our logging service and see in a structured way what has been the whole process.

4. Exception tracking

In this case the exceptions thrown from our services will be reported to an error tracking service, such as Sentry.

These exceptions will be displayed in detail in the tracking service, showing where the error was generated, the parameters received, the number of times it has occurred, etc.

With this information through the tracking service we can generate alerts of these errors.

For example, if we receive a 500 error when making a request to one of our endpoints, we will pick up this error and launch an alert, either via email or Slack. In our error tracking service, we will be able to see, at what time the error was generated, where the error was generated, the environment where it occurred or the number of times it occurred.

5. Application metrics

This template is in charge of collecting all the metrics of our services and gathering them in a common platform. The possible metrics collected are:

At the level of infrastructureCPU, memory, disk, etc.
At the level of applicationLatency in a request, number of requests, etc.
At the level of userApplication loading times

All these metrics, as well as the aforementioned patterns, will be stored in a centralized system, as could be New Relic o AWS CloudWatch.

As in the rest of the patterns, we can generate customized alerts from these metrics.

6. Audit logging

We will store all user activity with our service in a database.

We will be able to know how our application is being used, in order to identify possible performance problems or even security issues.

All the information related to the user will be stored, such as the user's identity, the action he/she has performed, etc.

Conclusion

Implementing observability in our system is a long-term investment to achieve a stable and maintainable system.

It gives us a way to detect errors without having a large context of the application.
When it comes to documenting and explaining any incident, it will be easier for us.
We will be able to be proactive in the face of potential problems of performance.

It is true that its implementation can be costly, but on balance, this will bring us more advantages than disadvantages.