Monitoring your microservices with openTelemetry

Where is the problem?

With the complexity and size of microservice systems, which include hundreds of small services, keeping an overview can be quite challenging. Usually one monolitic application can be scanned for potential errors, as failures in the system directly reference this one application. With microservices the area to search in could be narrowed down, but never to an extend adressing one specific service being the root cause. Therefore, to keep track of runtime problems and the fullfillment of nonfunctional reqirements corresponding to specififc microservices advanced monitoring needs emerge.

There are different proposals for collecting this telemetry data consisting of logs, traces and metrics with each proposal having their own benefits and disadvantages leading to a heterogenous landscape of these solutions special for each microservice. As these are configured on their own and little to no replacability options without nudging big change efforts are given, flexibility is limited blocking possibly better solutions. Concluding, overall and in the microservice teams there is not a central repository for looking up your microservices telemetry data but again a numerous amount of different hardly replacable solutions for different microservices making it difficult to gain a central overview e.g. via Grafana.

The value for the customer

Monitoring in microservice settings generally benefits nonfunctional requirements such as performance, availability or security through transparent insights, while enabling proactive but also reactive handling of issues. By having all the telemetry data centrally stored and providing exchangeable solutions for the different types of data, no restrictions to design decisions is made and the customer can gain a fast overview over a great amount of microservices.

Introducing OpenTelemetry

OpenTelemetry forms the combination of OpenTracing and OpenCensus collecting different telemetry data (metrics, logs, traces) for understanding the applications behavior and performance. Because of the standardization for processing telemetry data, OpenTelemetry acts as a central collector for whole application landscapes monitoring data, which is able to communicate with replacable logging-, tracing- or metrics-backends without changing configurations in the applications code.

Addressing your business

OpenTelemetry can be used in various use cases, but is best suited in the microservice context. Therefore, any business apllication relying on microservices would be a suited domain for applying an OpenTelemetry solution.

OpenTelemetry architecture

The above example shows a reference architecture of multiple hosts (environments/applications).

Traces and logs are automatically collected at each JAX-RS service. Other additionally defined metrics, traces or logs are also collected.
These applications send data directly to a otel-agent configured to use fewer resources. An otel-agent is a collector instance running with the application or on the same host as the application.
The agent then forwards the data to a collector that receives data from multiple agents. Collectors on this layer typically are allowed to use more resources and queue more data.
The collector then sends the data to the appropriate backend, in the solution provided by Jaeger, Zipkin or VictoriaMetrics. The VictoriaMetrics backend serves metrics while Jaeger and Zipkin are two alternatives for tracing.
Additionally all the telemetry data can be visualized in a tool like Grafana.

The OpenTelemetry collector is an instance that makes it possible to receive telemetry data, optionally transform it and send the data on. Receiving, transforming and sending data is done via pipelines. A pipeline consists of a set of receivers, a set of optional processors and a set of exporters.

A reveiver transfers the telemetry data into the collector, which then can be processed on a processor. Finally, an exporter can send the data to a corresponding backend/destination. Further information can be found here

List of available receivers, processors and exporters:

In addition, extensions (Health Check, Performance Profiler, zPages) are provided that can be added to the collector to extend the primary functionality of the collector. These do not require direct access to telemetry data and enable additional functionality outside the usual pipeline.

Deployment

You can find a template for an OpenTelemetry Kubernetes deployment in the GitHub repository of this article with deployment files for an OpenTelemetry agent and collector as well as Jaeger, VictoriaMetrics, and Grafana. For an example of how to use these files, see the devon4j Quarkus reference application.

Related Architectures and Alternatives

openTracing (-OpenTracing has been Archived-)
openCensus

_{Last updated 2023-11-20 10:40:30 UTC}