01 December 2024
In this tutorial I want to consolidate some practical ideas regarding OpenTelemetry and how to use it with Spring Boot.
This tutorial is composed by four sections
By the end of the tutorial, you should be able to implement the following architecture:
As the official documentation states, OpenTelemetry is
To keep things short, monitoring is the process of collecting, processing and analyzing data to track the state of a (information) system. Then, monitoring is going to the next level, to actually understand the information that is being collected and do something with it, like defining alerts for a given system.
To achieve both goals it is necessary to collect three dimensions of data, specifically:
Hence, when the state of a given system is altered in some way, we have an Event, that correlates and ideally generates data on the three dimensions.
Developers recognize by experience that monitoring and observability are important, either to evaluate the actual state of a system or to do post-mortem analysis after disasters. Hence, it is natural to think that observability has been implemented in various ways. For example if we think on a system constructed with Java we have at least the following collection points:
This variety in turn imposes a great amount of complexity in instrumenting our systems to provide information, that a- comes in different formats, from b- technology that is difficult to implement, often with c- solutions that are too tied to a given provider or in the worst cases, d- technologies that only work with certain languages/frameworks.
And that's the magic about OpenTelemetry proposal, by creating a working group under the CNCF umbrella the project is able to provide useful things like:
In short, OpenTelemetry is the reunion of various great monitoring ideas that overlapping software communities can implement to facilitate the burden of monitoring implementations.
For me, the easiest way to think about OpenTelemetry concepts is a data pipeline, in this data pipeline you need to
As OpenTelemetry got popular various vendors have implemented support for it, to mention a few:
Self-hosted platforms
Cloud platforms
Hence, for development purposes, it is always useful to know how to bootstrap a quick observability stack able to receive and show OpenTelemetry capabilities.
For this purpose we will use the following elements:
And of course OpenTelemetry collector. This example is based on various Grafana examples, with a little bit of tweaking to demonstrate the different ways of collecting, processing and sending data to backends.
As stated previously, OpenTelemetry collector acts as an intermediary that receives/pull information from data sources, processes this information and, forwards the information to destinations like analysis platforms or even other collectors. The collector is able to do this either with compliant workloads or via plugins that talk with the workloads using proprietary formats.
As the plugins collection can be increased or decreased, vendors have created their own distributions of OpenTelemetry collectors, for reference I've used successfully in the real world:
You could find a complete list directly on OpenTelemetry website.
For this demonstration, we will create a data pipeline using the contrib version of the reference implementation which provides a good amount of receivers, exporters and processors. In our case Otel configuration is designed to:
otel-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
attributes:
actions:
- key: team
action: insert
value: vorozco
exporters:
debug:
prometheus:
endpoint: "0.0.0.0:8889"
otlphttp:
endpoint: http://loki:3100/otlp
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
extensions: [zpages]
pipelines:
metrics:
receivers: [otlp]
processors: [attributes]
exporters: [debug,prometheus]
traces:
receivers: [otlp]
exporters: [debug, otlp]
logs:
receivers: [otlp]
exporters: [debug, otlphttp]
extensions:
zpages:
endpoint: "0.0.0.0:55679"
Prometheus is a well known analysis platform, that among other things offers dimensional data and a performant time-series storage.
By default it works as a metrics scrapper, then, workloads provide a http endpoint offering data using the Prometheus format. For our example we configured Otel to offer metrics to the prometheus host via port 8889.
prometheus:
endpoint: "prometheus:8889"
Then, whe need to configure Prometheus to scrape the metrics from the Otel host. You would notice two ports, the one that we defined for the active workload data (8889) and another for metrics data for the collector itself (8888).
prometheus.yml
scrape_configs:
- job_name: "otel"
scrape_interval: 10s
static_configs:
- targets: ["otel:8889"]
- targets: ["otel:8888"]
It is worth highlighting that Prometheus also offers a way to ingest information instead of scrapping it, and, the official support for OpenTelemetry ingestion is coming on the new versions.
As described in the website, Loki is a specific solution for log aggregation heavily inspired by Prometheus, with the particular design decision to NOT format in any way the log contents, leaving that responsibility to the query system.
To configure the project for local environments, the project offers a configuration that is usable for most of the development purposes. The following configuration is an adaptation to preserve the bare minimum to work with temporal files and memory.
loki.yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
limits_config:
allow_structured_metadata: true
Then, we configure an exporter to deliver the data to the loki host using oltphttp format.
otlphttp:
endpoint: http://loki:3100/otlp
In similar fashion than Loki, Tempo is an Open Source project created by grafana that aims to provide a distributed tracing backend. On a personal note, for me besides performance it shines for being compatible not only with OpenTelemetry, it can also ingest data in Zipkin and Jaeger formats.
To configure the project for local environments, the project offers a configuration that is usable for most of the development purposes. The following configuration is an adaptation to remove the metrics generation and simplify the configuration, however with this we loose the service graph feature.
tempo.yaml
stream_over_http_enabled: true
server:
http_listen_port: 3200
log_level: info
query_frontend:
search:
duration_slo: 5s
throughput_bytes_slo: 1.073741824e+09
metadata_slo:
duration_slo: 5s
throughput_bytes_slo: 1.073741824e+09
trace_by_id:
duration_slo: 5s
distributor:
receivers:
otlp:
protocols:
http:
grpc:
ingester:
max_block_duration: 5m # cut the headblock when this much time passes. this is being set for demo purposes and should probably be left alone normally
compactor:
compaction:
block_retention: 1h # overall Tempo trace retention. set for demo purposes
storage:
trace:
backend: local # backend configuration to use
wal:
path: /var/tempo/wal # where to store the wal locally
local:
path: /var/tempo/blocks
Then, we configure an exporter to deliver the data to Tempo host using oltp/grpc format.
otlp:
endpoint: tempo:4317
tls:
insecure: true
Loki, Tempo and (to some extent) Prometheus are data storages, but we still need to show this data to the user. Here, Grafana enters the scene.
Grafana offers a good selection of analysis tools, plugins, dashboards, alarms, connectors and a great community that empowers observability. Besides having a great compatibility with Prometheus, it offers of course a perfect compatibility with their other offerings.
To configure Grafana you just need to plug compatible datasources and the rest of work will be on the web ui.
grafana.yaml
apiVersion: 1
datasources:
- name: Otel-Grafana-Example
type: prometheus
url: http://prometheus:9090
editable: true
- name: Loki
type: loki
access: proxy
orgId: 1
url: http://loki:3100
basicAuth: false
isDefault: true
version: 1
editable: false
- name: Tempo
type: tempo
access: proxy
orgId: 1
url: http://tempo:3200
basicAuth: false
version: 1
editable: false
apiVersion: 1
uid: tempo
At this point you may have noticed that I've referred to the backends using single names, this is because I intend to set these names using a Podman Compose deployment.
otel-compose.yml
version: '3'
services:
otel:
container_name: otel
image: otel/opentelemetry-collector-contrib:latest
command: [--config=/etc/otel-config.yml]
volumes:
- ./otel-config.yml:/etc/otel-config.yml
ports:
- "4318:4318"
- "4317:4317"
- "55679:55679"
prometheus:
container_name: prometheus
image: prom/prometheus
command: [--config.file=/etc/prometheus/prometheus.yml]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9091:9090"
grafana:
container_name: grafana
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
image: grafana/grafana
volumes:
- ./grafana.yml:/etc/grafana/provisioning/datasources/default.yml
ports:
- "3000:3000"
loki:
container_name: loki
image: grafana/loki:3.2.0
command: -config.file=/etc/loki/local-config.yaml
volumes:
- ./loki.yaml:/etc/loki/local-config.yaml
ports:
- "3100"
tempo:
container_name: tempo
image: grafana/tempo:latest
command: [ "-config.file=/etc/tempo.yaml" ]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
ports:
- "4317" # otlp grpc
- "4318"
At this point the compose description is pretty self-descriptive, but I would like to highlight some things:
Once the configuration is ready, you can launch it using the compose file
cd podman
podman compose -f otel-compose.yml up
If the configuration is ok, you should have five containers running without errors.
As part of my daily activities I was in charge of a major implementation of all these concepts. Hence it was natural for me to create a proof of concept that you could find at my GitHub.
For demonstration purposes we have two services with different HTTP endpoints:
springboot-demo:8080
- Useful to demonstrate local and database tracing, performance, logs and OpenTelemetry instrumentation
/books
- A books CRUD using Spring Data/fibo
- A Naive Fibonacci implementation that generates CPU load and delays/log
- Which generate log messages using the different SLF4J levelsspringboot-client-demo:8081
- Useful to demonstrate tracing capabilities, Micrometer instrumentation and Micrometer Tracing instrumentation
/trace-demo
- A quick OpenFeing client that invokes books GetAll Books demoGiven the popularity of OpenTelemetry, developers can expect also multiple instrumentation options.
First of all, the OpenTelemetry project offers a framework-agnostic instrumentation that uses bytecode manipulation, for this instrumentation to work you need to include a Java Agent via Java Classpath. In my experience this instrumentation is preferred if you don't control the workload or if your platform does not offer OpenTelemetry support at all.
However, instrumentation of workloads can become really specific -e.g. instrumentation of a Database pool given a particular IoC mechanism-. For this, the Java world provides a good ecosystem, for example:
And of course Spring Boot.
Spring Boot is a special case with TWO major instrumentation options
Both options use Spring concepts like decorators and interceptors to capture and send information to the destinations. The only rule is to create the clients/services/objects in the Spring way (hence via Spring IoC).
I've used both successfully and my heavily opinionated conclusion is the following:
As these libraries are mutually exclusive, if the decision is mine, I would pick OpenTelemetry's Spring Boot starter. It offers logs support OOB and also a bridge for micrometer Metrics.
As always, it is also good to consider the official documentation.
Otel instrumentation with the Spring started is activated in three steps:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-instrumentation-bom</artifactId>
<version>2.10.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
...
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>
There is a set of optional libraries and adapters that you can configure if your workloads already diverged from the "Spring Way"
You need to activate (or not) the dimensions of observability (metrics, traces and logs). Also, you can finetune the exporting parameter like ports, urls or exporting periods. Either by using Spring Properties or env variables
#Configure exporters
otel.logs.exporter=otlp
otel.metrics.exporter=otlp
otel.traces.exporter=otlp
#Configure metrics generation
otel.metric.export.interval=5000 #Export metrics each five seconds
otel.instrumentation.micrometer.enabled=true #Enabe Micrometer metrics bridge
Again, this instrumentation does not support logs exporting. Also, it is a good idea to check the latest documentation for Micrometer and Micrometer Tracing.
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-otlp</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
management.otlp.tracing.endpoint=http://localhost:4318/v1/traces
management.otlp.tracing.timeout=10s
management.tracing.sampling.probability=1
management.otlp.metrics.export.url=http://localhost:4318/v1/metrics
management.otlp.metrics.export.step=5s
management.opentelemetry.resource-attributes."service-name"=${spring.application.name}
The POC provides the following structure
├── podman # Podman compose config files
├── springboot-client-demo #Spring Boot Client instrumented with Actuator, Micrometer and MicroMeter tracing
└── springboot-demo #Spring Boot service instrumented with OpenTelemetry Spring Boot Starter
cd podman
podman compose -f otel-compose.yml up
This will provide you an instance of Grafana on port 3000
Then, it is time to boot the first service!. You only need Java 21 on the active shell:
cd springboot-demo
mvn spring-boot:run
If the workload is properly configured, you will see the following information on the OpenTelemetry container standard output. Which basically says you are successfully reporting data.
[otel] | 2024-12-01T22:10:07.730Z info Logs {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 24}
[otel] | 2024-12-01T22:10:10.671Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 64, "data points": 90}
[otel] | 2024-12-01T22:10:10.672Z info Traces {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 5}
[otel] | 2024-12-01T22:10:15.691Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 65, "data points": 93}
[otel] | 2024-12-01T22:10:15.833Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 65, "data points": 93}
[otel] | 2024-12-01T22:10:15.835Z info Logs {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 5}
The data is being reported over the OpenTelemetry ports (4317 and 4318) which are open from Podman to the host. By default all telemetry libraries report to localhost, but this can be configured for other cases like FaaS or Kubernetes.
Also, you could verify the reporting status in ZPages
Finally let's do the same with Spring Boot client:
cd springboot-client-demo
mvn spring-boot:run
As described in the previous section, I created a set of interactions to:
Generate CPU workload using Naive fibonacci
curl http://localhost:8080/fibo\?n\=45
Generate logs in different levels
curl http://localhost:8080/fibo\?n\=45
Persist data using a CRUD
curl -X POST --location "http://localhost:8080/books" \
-H "Content-Type: application/json" \
-d '{
"author": "Miguel Angel Asturias",
"title": "El señor presidente",
"isbn": "978-84-376-0494-7",
"publisher": "Editorial planeta"
}'
And then retrieve the data using a secondary service
curl http://localhost:8081/trace-demo
This asciicast shows the interaction:
Once the data is accesible by Grafana, the what to do with data is up to you, again, you could:
The quickest way to verify if the data is reported correctly is to verify directly in Grafana explore.
First, we can check some metrics like system_cpu_usage and filter by service name. In this case I used springboot-demo
which has the CPU demo using naive fibonacci, I can even filter by my own tag (which was added by Otel processor):
In the same way, logs are already stored in Loki:
Finally, we could check the whole trace, including both services and interaction with H2 RDBMS: