Ryan Harrison My blog, portfolio and technology related ramblings

Prometheus Monitoring Guide Part 2 - PromQL and Recording Rules

PromQL

Prometheus provides a functional query language called PromQL (Prometheus Query Language) that lets the user select and aggregate time series data in real time. The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus’s expression browser, or consumed by external systems via the HTTP API.

Data Types

An expression or sub-expression can evaluate to one of four types:

  • Instant vector - a set of time series containing a single sample for each time series, all sharing the same timestamp (prometheus_http_requests_total)
  • Range vector - a set of time series containing a range of data points over time for each time series (prometheus_http_requests_total[5m])
  • Scalar - a simple numeric floating point value

Depending on the use-case (e.g. when graphing vs. displaying the output of an expression), only some of these types are legal as the result from a user-specified expression. For example, an expression that returns an instant vector is the only type that can be directly graphed.

Selectors and Matchers

In the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time series that have this metric name:

http_requests_total

It is possible to filter these time series further by appending a comma separated list of label matchers in curly braces ({}).

This example selects only those time series with the http_requests_total metric name that also have the job label set to prometheus and their group label set to canary:

http_requests_total{job="prometheus",group="canary"}
  • = - select labels that are exactly equal to the provided string
  • != - select labels that are not equal to the provided string
  • =~ - select labels that regex-match the provided string
  • !~ - select labels that do not regex-match the provided string

Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant. A time duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time values should be fetched for each resulting range vector element.

In this example, we select all the values we have recorded within the last 5 minutes for all time series that have the metric name http_requests_total and a job label set to prometheus:

http_requests_total{job="prometheus"}[5m]

Operators

Prometheus’s query language supports basic logical and arithmetic operators. For operations between two instant vectors, the matching behavior can be modified.

  • binary arithmetic operators are defined between scalar/scalar, vector/scalar, and vector/vector value pairs. (+, -, *, /, %, ^)
  • comparison operators are defined between scalar/scalar, vector/scalar, and vector/vector value pairs. By default they filter. Their behaviour can be modified by providing bool after the operator, which will return 0 or 1 for the value rather than filtering (==, !=, >, >=)
  • operations between vectors attempt to find a matching element in the right-hand side vector for each entry in the left-hand side.
    • when applying operators Prometheus attempts to find a matching element in both vectors by labels. Can ignore labels to get matches
    • method_code:http_errors:rate5m{code="500"} / ignoring(code) method:http_requests:rate5m
  • aggregation operators can be used to aggregate the elements of a single instant vector, resulting in a new vector of fewer elements with aggregated values: (sum, min, max, avg, count, topk, quantile)
    • if the metric http_requests_total had time series that fan out by application, instance, and group labels, we could calculate the total number of seen HTTP requests per application and group over all instances via: sum without (instance) (http_requests_total)
  • rate calculates per second increment over a time-period (takes in a range vector and outputs an instant vector)
  • https://prometheus.io/docs/prometheus/latest/querying/functions

Examples

Return all time series with the metric http_requests_total:

http_requests_total

Return all time series with the metric http_requests_total and the given job and handler labels:

http_requests_total{job="apiserver", handler="/api/comments"}

Return a whole range of time (in this case 5 minutes) for the same vector, making it a range vector (not graphable):

http_requests_total{job="apiserver", handler="/api/comments"}[5m]

Return the 5-minute rate of the http_requests_total metric for the past 30 minutes, with a resolution of 1 minute:

rate(http_requests_total[5m])[30m:1m]

Return sum of 5-minute rate over all instances by job name:

sum by (job) (
  rate(http_requests_total[5m])
)

Return the unused memory in MiB for every instance:

If we have two different metrics with the same dimensional labels, we can apply binary operators to them and elements on both sides with the same label set will get matched and propagated to the output:

(instance_memory_limit_bytes - instance_memory_usage_bytes) / 1024 / 1024

The same expression, but summed by application, could be written like this:

sum by (app, proc) (
  instance_memory_limit_bytes - instance_memory_usage_bytes
) / 1024 / 1024

Return the top 3 CPU users grouped by application (app) and process type (proc):

topk(3, sum by (app, proc) (rate(instance_cpu_time_ns[5m])))

Return the count of the total number of running instances per application:

count by (app) (instance_cpu_time_ns)

Recording Rules

Prometheus supports two types of rules which may be configured and then evaluated at regular intervals: recording rules and alerting rules. To include rules in Prometheus, create a file containing the necessary rule statements and have Prometheus load the file via the rule_files field in the config.

  • recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series
  • querying the precomputed result will then often be much faster than executing the original expression every time it is needed
  • this is especially useful for dashboards which need to query the same expression repeatedly every time they refresh

Recording and alerting rules exist in a rule group. Rules within a group are run sequentially at a regular interval, with the same evaluation time. The names of recording rules must be valid metric names. The names of alerting rules must be valid label values.

Rule Definitions

  • Recording rules should be of the general form level:metric:operation
    • level = the aggregation level of the metric and labels of the rule output
    • metric = the metric name under evaluation
    • operation = list of operations applied to the metric under evaluation
# rules/myrules.yml
groups:
  - name: example # The name of the group. Must be unique within a file.
    rules:
    - record: job:http_inprogress_requests:sum # The name of the time series to output to. Must be a valid metric name.
    # The PromQL expression to evaluate. Every evaluation cycle this is
    # evaluated at the current time, and the result recorded as a new set of
    # time series with the metric name as given by 'record'.
      expr: sum by (job) (http_inprogress_requests)

The rule file paths need to be added into the main Prometheus config to be executed periodically as defined by evaluation_interval

rule_files:
  - "rules/myrules.yml"

Checking Rule Syntax

To quickly check whether a rule file is syntactically correct without starting a Prometheus server, you can use Prometheus’s promtool command-line utility tool:

promtool check rules /path/to/example.rules.yml

HTTP API

Allows direct endpoints for querying instant/range queries, viewing targets, configuration etc

  • localhost:9090/api/v1/query?query=up
  • localhost:9090/api/v1/query?query=http_requests_total[1m]
  • localhost:9090/api/v1/targets?state=active / localhost:9090/api/v1/rules?type=alert

https://prometheus.io/docs/prometheus/latest/querying/api/

Read More

Prometheus Monitoring Guide Part 1 - Installation and Instrumentation

An open-source systems monitoring and alerting toolkit. Now a standalone open source project and maintained independently of any company. Analyse how your applications and infrastructure is performing based on the metrics they publish. Particularly suitable for large distributed systems with ephemeral instances.

  • metrics are stored in a multi-dimensional data model instead of hierarchical, where each measurement consists of a name and a number of key/value pairs
    • http.server.requests.count(uri="/endpoint", method="GET") 12 instead of http.server.requests.GET.endpoint=12
    • backed by it’s own custom time-series database built specifically for metrics
    • provides it’s own query language PromQL as a read-only and flexible query language for aggregation on time series data
  • no reliance on distributed storage; single server nodes are autonomous
  • time series collection happens via a pull model over HTTP
  • targets are discovered via service discovery or static configuration
  • multiple modes of graphing and dashboarding support
  • Alternative to Graphite, InfluxDB
  • Ecosystem provides a number of pre-built exporters that expose metrics ready for Prometheus to scrape

Architecture

Prometheus Architecture

  • main Prometheus server consists of a time series database storing each of the captured measurements
    • alongside a scraper which pulls samples from external applications, hosts, platforms
    • and an HTTP server which allows operations to be performed on the tsdb (e.g querying by PromQL)
  • Prometheus is a single-instance component; all data is stored on local node storage
    • if you need to scale, recommendation is to spin up multiple separate Prometheus instances with different/replicated targets
  • operates in a pull model, whereby Prometheus is setup to periodically scrape the metrics from all target application instances
    • therefore has to know about the location of all active instances via service discovery
    • more easily tell if a target is down, can manually go to a target and inspect its health with a web browser
    • application itself has no knowledge of Prometheus apart from an endpoint exposing the latest metrics snapshot (/metrics)
  • the pull model can be problematic for short-lived/batch operations which may not be alive long enough to be scraped
    • Pushgateway component can be used as a middle-man - gets pushed metrics from jobs and forwards them to Prometheus
  • service discovery integration into Kubernetes, AWS, Azure to understand current landscape of all target nodes
  • after metrics are scraped and stored in tsdb, can be made available to users through web UI/Grafana/API
  • AlertManager component can be used with a set of rules querying the metrics to generate system alerts
    • performs de-duplication of alerts, throttling, silences etc
    • forwards to email, PagerDuty, Slack etc

Installation

  • provided as a single Go binary from https://prometheus.io/download/ so can be executed directly
    • ./prometheus
    • by default looks for prometheus.yml config file in the same directory (by default will only scrape Prometheus itself)
  • Prometheus Web UI available at http://localhost:9090
    • allows you to view current targets, configuration and run queries. For more complex visualizations use Grafana
  • or can be executed through Docker
    • docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Configuration

  • all Prometheus config provided in prometheus.yml file
  • Prometheus can reload its configuration at runtime. If the new configuration is not well-formed, the changes will not be applied
    • reload is triggered by sending a SIGHUP to the Prometheus process (kill -HUP <pid>)
    • or sending an HTTP POST request to the /-/reload endpoint when the --web.enable-lifecycle flag is enabled
global: # applied to all targets unless overridden
  scrape_interval: 15s # how often to scrape each target
  evaluation_interval: 15s # how often to evaluate predefined rules

scrape_configs: # a set of jobs denoting where to scrape metrics from
  - job_name: "prometheus" # a group of targets, also added as a label to each measurement
    metrics_path: '/metrics' # default
    static_configs: # a hardcoded host/port or could be service discovery
      - targets: ["localhost:9090"] # uses metrics path to pull metrics, by default http

Service Discovery

  • static_configs does not scale to more dynamic environments where instances are added/removed frequently
  • Prometheus can integrate with service discovery mechanisms to automatically update it’s view of of running instances
    • when new instances are added Prometheus will begin scraping, when lost from discovery the time series will also be removed
    • built-in integrations with Consul, Azure, AWS or file based if custom mechanism required
  • JSON/YAML file can be published by the platform specifying all targets to scrape from. Prometheus uses it to automatically update targets

Prometheus Service Discovery

- job_name: 'myapp'
  file_sd_configs:
    refresh_interval: 5m # default
    - files:
      - file_sd.yml

Where the published file file_sd.json contains all the current live targets and any attached labels:

[
  {
    "targets": [
      "<host>"
    ],
    "labels": {
      "<labelname>": "<labelvalue>"
    }
  }
]

Relabelling

Prometheus needs to know what to scrape, and that’s where service discovery and relabel_configs come in. Relabel configs allow you to select which targets you want scraped, and what the target labels will be. So if you want to say scrape this type of machine but not that one, use relabel_configs.

metric_relabel_configs by contrast are applied after the scrape has happened, but before the data is ingested by the storage system. So if there are some expensive metrics you want to drop, or labels coming from the scrape itself (e.g. from the /metrics page) that you want to manipulate that’s where metric_relabel_configs applies.

So as a simple rule of thumb: relabel_config happens before the scrape, metric_relabel_configs happens after the scrape

  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_app]
    regex: nginx
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    regex: web
    action: keep
  - source_labels: [__meta_ec2_public_ip] # rewrite private ip to public from service discovery
    regex: '(.*)'
    replacement: '${1}:9100'
    target_label: __address__    

You can perform the following action operations:

  • keep: Keep a matched target or series, drop all others
  • drop: Drop a matched target or series, keep all others
  • replace: Replace or rename a matched label with a new one defined by the target_label and replacement parameters
  • labelkeep: Match the regex against all label names, drop all labels that don’t match (ignores source_labels and applies to all label names)
  • labeldrop: Match the regex against all label names, drop all labels that match (ignores source_labels and applies to all label names)

metric_relabel_configs can be used to drop unnecessary time-series before ingestion:

- job_name: cadvisor
  metric_relabel_configs:
  - source_labels: [container_label_JenkinsId]
    regex: '.+'
    action: drop
  - source_labels: [__name__]
    regex: '(container_tasks_state|container_memory_failures_total)'
    action: drop

https://grafana.com/docs/grafana-cloud/billing-and-usage/prometheus/usage-reduction/

Instrumentation

  • there are two ways in which application metrics can be exposed for Prometheus
    • use a client library directly in the application to create and expose a Prometheus endpoint (usually /metrics)
    • use an intermediate proxy exporter instance instrumenting the target application and converting to the Prometheus metrics format
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

Conventions and Practices

  • metrics names should start with a letter can be followed with any number of letters, numbers and underscores
  • metrics must have unique names, client libraries should report an error if you try to register the same one twice
  • should have a suffix describing the unit, in plural form (e.g _bytes or _total)
  • should represent the same logical thing-being-measured across all label dimensions
  • every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values

Exporters

There are a number of libraries and servers which help in exporting existing metrics from third-party systems as Prometheus metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus metrics directly (Linux kernel) as cannot modify source etc

  • https://prometheus.io/docs/instrumenting/exporters/
  • an exporter is a separate process dedicated entirely to pulling metrics from a target system and exposing them as Prometheus metrics
    • “proxy service” converting the target interface into one that can be scraped by Prometheus
  • Common exporters (some official):
    • Node Exporter - hardware and OS metrics exposed by Unix kernels, CPU load, memory, I/O, network
    • MySQL Expoter - database metrics, queries ran, timings, pool sizes
    • Blackbox Exporter - probing of endpoints over HTTP, DNS, TCP, ICMP
    • Kafka, Kafka Lag, Nginx, Postgres, Jenkins, AWS, Graphite, JMX
  • for cases which you have access to modify the application code, instrumentation must be added in order to add Prometheus metrics
    • can use existing application frameworks to expose default common metrics (Spring Actuator)
    • use the client libraries to add custom metrics to be exposed (Go, Java, Python, Ruby)
  • other metrics libraries offer a facade over the definition of metrics and allow pluggable Prometheus exporter to be added instead (Micrometer)
    • don’t have to use Prometheus client library directly for increased flexibility in overall monitoring solution

Metric Types

The client libraries offer four core metric types. These are currently only differentiated in the client libraries (to enable APIs tailored to the usage of the specific types) and in the wire protocol. The Prometheus server does not yet make use of the type information and flattens all data into untyped time series.

Counter

  • a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart
  • for example, you can use a counter to represent the number of requests served, tasks completed, or errors
  • do not use a counter to expose a value that can decrease. For example, do not use a counter for the number of currently running processes; instead use a gauge

Gauge

  • a metric that represents a single numerical value that can arbitrarily go up and down
  • gauges are typically used for measured values like temperatures or current memory usage, but also “counts” that can go up and down, like the number of concurrent requests

Histogram

  • samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values
  • a histogram with a base metric name of <basename> exposes multiple time series during a scrape:
    • cumulative counters for the observation buckets, exposed as <basename>_bucket{le="<upper inclusive bound>"}
    • the total sum of all observed values, exposed as <basename>_sum
    • the count of events that have been observed, exposed as <basename>_count (identical to <basename>_bucket{le="+Inf"} above)
  • use the histogram_quantile() function to calculate quantiles from histograms or even aggregations of histograms across instances
  • when operating on buckets, remember that the histogram is cumulative

Summary

  • similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window
  • a summary with a base metric name of <basename> exposes multiple time series during a scrape:
    • streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as <basename>{quantile="<φ>"}
    • the total sum of all observed values, exposed as <basename>_sum
    • the count of events that have been observed, exposed as <basename>_count
  • https://prometheus.io/docs/practices/histograms/
    • if you need to aggregate, choose histograms.
    • otherwise, choose a histogram if you have an idea of the range and distribution of values that will be observed. Choose a summary if you need an accurate quantile, no matter what the range and distribution of the values is.
Read More

How to Dynamically Change Log Levels at Runtime with Spring Boot

As we can all probably agree, logging is a vital part of any application. It provides visibility into what our component is doing at any point in time and is an invaluable tool when trying to debug issues. When faced with a problem in production, likely one of the first things that will be done is to check the logs for any errors and to hopefully locate the source of the issue. It’s therefore vital that our logs are not only relevant and useful (perhaps a topic for another day), but are also visible and accessible when we need them the most. Hopefully we will be lucky and the logs will include all the necessary details to pinpoint the issue - or maybe they won’t. This may be decided by the level at which the loggers are configured.

By default, most if not all of our loggers will probably be set to INFO level. This is fine when everything is working correctly - including only the key operations/tasks and not creating too much noise as to overload the logging tools - but perhaps not so great when you need to work an issue. Although the developers may have included some basic INFO level log output within the problematic area, this may not include enough data to properly trace through an erroneous transaction. The same developer may have also added some additional DEBUG log lines (hopefully, if not they should do) that give us some additional details into what was being done, but these may not be visible to us when the application is deployed within an environment.

I bet most people reading this can remember that instance in which they were debugging an issue, were able to focus down on a particular segment of code, only to find that the log line that gave them that critical piece of information was set to DEBUG level and so was unavailable to them. We may have 6 different log levels available, but without finer control over their configuration at runtime, we may as well only have half that.

Spring Boot Logging Properties

Log configuration within Spring Boot applications takes a number of different forms these days, but in this post we’ll focus just on the main ones (excluding specific logback.xml modifications etc). A lot of the same configuration changes you would previously have made in those dedicated files can now also be set as simple application properties. For example, to change the global root log level to DEBUG, we can just add the following to application.properties (or YAML equivalent):

logging.level.root=DEBUG

This however has similar problems to the conventional logback.xml approach - to update these configuration settings you would have to rebuild and redeploy your entire application - not something that’s plausible in production. The difference here is that these all just standard Spring Boot properties, so in much the same way as you would elsewhere, you can set them as environment variables or pass them in as JVM options:

-Dlogging.level.root=DEBUG

This gets us somewhat closer to where we want to be - we can now control the log level without redeploying - but it also introduces another problem. This enables DEBUG log output for the entire application. If you’ve ever done this before, even in a basic Spring application, you will know that the output is extremely noisy:

  • At default INFO level a simple Spring Boot app generates 14 lines of log output at startup
  • At DEBUG level the same application generates over 2440 log lines - just at startup!
DEBUG 6852 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Creating shared instance of singleton bean 'org.springframework.boot.autoconfigure.web.servlet.MultipartAutoConfiguration'
DEBUG 6852 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Creating shared instance of singleton bean 'spring.servlet.multipart-org.springframework.boot.autoconfigure.web.servlet.MultipartProperties'
2021-01-30 17:36:25.708 DEBUG 6852 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Autowiring by type from bean name 'org.springframework.boot.autoconfigure.web.servlet.MultipartAutoConfiguration' via constructor to bean named 'spring.servlet.multipart-org.springframework.boot.autoconfigure.web.servlet.MultipartProperties'
DEBUG 6852 --- [1)-192.168.1.22] sun.rmi.transport.tcp                    : RMI TCP Connection(1)-192.168.1.22: (port 65235) op = 80
DEBUG 6852 --- [1)-192.168.1.22] javax.management.remote.rmi              : [javax.management.remote.rmi.RMIConnectionImpl@3440316d: connectionId=rmi://192.168.1.22  1] closing.
DEBUG 6852 --- [1)-192.168.1.22] javax.management.remote.rmi              : [javax.management.remote.rmi.RMIConnectionImpl@3440316d: connectionId=rmi://192.168.1.22  1] closed.
DEBUG 6852 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Autowiring by type from bean name 'errorPageCustomizer' via factory method to bean named 'dispatcherServletRegistration'
DEBUG 6852 --- [           main] o.apache.tomcat.util.IntrospectionUtils  : IntrospectionUtils: setProperty(class org.apache.coyote.http11.Http11NioProtocol port=8080)

If you look at the some of the output above you will notice that most of it is just noise and won’t help you resolve any real issues. In fact, it may hinder you when trying to search through all the volume just to find those few useful lines.

In reality setting the root log level to DEBUG is not something you really want to do unless you feel some need to closely inspect the Spring startup sequence. It belongs at the default INFO level where the frameworks and other libraries won’t overwhelm you with data. What you actually probably wanted was to only set the new log level for certain targeted areas of your application - be it a set of packages or even a particular class in much the way you can in logback.xml. Spring Boot also allows you to use the same property syntax for this. For example, to enable DEBUG output only in our service packages or specifically within our WidgetService class, we can instead add the following to application.properties:

logging.level.com.example.demo.service=DEBUG
logging.level.com.example.demo.service.WidgetService=DEBUG

Checking the log output again, you should see most of the noise from before disappear - leaving only our targeted package at the lower log level. Much more useful! You might notice that this really just has the same effect as updating the logback.xml file directly, and you would be correct. The big difference here however is that when combined with the VM argument trick from before, you now have much fine-grained control over your loggers after your application is deployed.

INFO 8768  --- [nio-8080-exec-1] com.example.demo.service.WidgetService   : Finding widgets for bob
DEBUG 8768 --- [nio-8080-exec-1] com.example.demo.service.WidgetService   : Background working happening to find widgets..
DEBUG 8768 --- [nio-8080-exec-1] com.example.demo.service.WidgetService   : Found user bill, returning widget 2

This is another improvement, but still has one big problem - you need to restart your application for the changes to be applied. This might be ok for some cases, but for example if the underlying issue was due some particular bad state of a local cache, restarting would reset the component, hide the underlying cause and make it much harder to reproduce further. Depending on the issue it may be possible to get your application back into the same state again after rehydrating caches etc, but sometimes “turn it off and on again” just hides the larger underlying problem.

As a last improvement, it would be great if you could dynamically change the log levels for particular targeted areas at runtime - without having to rebuild, redeploy or restart.

Spring Boot Actuator

Actuator, amongst other things, allows you to do just this through one of it’s administration endpoints. To enable this functionality we first need to add the starter dependency to the build.gradle file:

implementation 'org.springframework.boot:spring-boot-starter-actuator'

By default actuator doesn’t expose any of it’s admin endpoints, so we also need to explicitly enable the loggers feature(and others if you need to). Add the following into application.properties:

management.endpoints.web.exposure.include=loggers

If we now visit http://localhost:8080/actuator in the browser (or as GET request from elsewhere) you should see just the logger endpoints are now enabled:

{
  "_links": {
    "self": {
      "href": "http://localhost:8080/actuator",
      "templated": false
    },
    "loggers": {
      "href": "http://localhost:8080/actuator/loggers",
      "templated": false
    },
    "loggers-name": {
      "href": "http://localhost:8080/actuator/loggers/{name}",
      "templated": true
    }
  }
}

Actuator hooks into the log system used within your app (be it Logback or Log4j) and allows you to interact with these endpoints as an API to query and modify the underlying log levels.

Viewing Current Log Levels

First of all visit http://localhost:8080/actuator/loggers (via GET request) to see all the loggers configured within your application and their current levels (it will likely be quite a large list):

{
  "levels": [
    "OFF", "ERROR", "WARN", "INFO", "DEBUG", "TRACE"
  ],
  "loggers": {
    "ROOT": {
      "configuredLevel": "INFO",
      "effectiveLevel": "INFO"
    },
    "com.example.demo.DemoApplication": {
      "configuredLevel": null,
      "effectiveLevel": "INFO"
    },
    "com.example.demo.service": {
      "configuredLevel": null,
      "effectiveLevel": "INFO"
    },
    "com.example.demo.service.WidgetService": {
      "configuredLevel": null,
      "effectiveLevel": "INFO"
    }
  }
}

In the above extract of the output we can see all the available levels, the current level for each area of our application and finally whether or not a custom level has been applied. These are all null for now since we are yet to override anything. This is not too helpful, but does help in identifying the logger names that we can later customize.

If you want to explicitly target a specific logger, you can add the name to the path (as described in the first actuator output). For example http://localhost:8080/actuator/loggers/com.example.demo.service will return:

{
  "configuredLevel": null,
  "effectiveLevel": "INFO"
}

Modifying Log Levels

In the above examples we have been using simple GET requests to query the current log configuration. The /actuator/loggers/{name} endpoint however also lets you send a POST request that allows you to update the configured level for a particular logger. For example, to change our service loggers to DEBUG level, send a POST request to http://localhost:8080/actuator/loggers/com.example.demo.service with the JSON body:

{
    "configuredLevel": "DEBUG"
}

The corresponding cURL command would be (note the Content-Type header needs to be set as the payload is a JSON object):

curl -i -X POST -H 'Content-Type: application/json' -d '{"configuredLevel": "DEBUG"}'
  http://localhost:8080/actuator/loggers/com.example.demo.service
  HTTP/1.1 204

If successful, the API will return a 204 No Content response. Checking the application logs again after some additional calls were made to the service class, you should see the same DEBUG log output as before, whilst all other output remains at the default INFO level:

INFO 10848  --- [nio-8080-exec-2] com.example.demo.service.WidgetService   : Finding widgets for bob
DEBUG 10848 --- [nio-8080-exec-2] com.example.demo.service.WidgetService   : Background working happening to find widgets..
DEBUG 10848 --- [nio-8080-exec-2] com.example.demo.service.WidgetService   : Found user bill, returning widget 2

To confirm the update, we can also try querying actuator again with the same logger name to view the updated configuration - GET http://localhost:8080/actuator/loggers/com.example.demo.service:

{
    "configuredLevel": "DEBUG",
    "effectiveLevel": "DEBUG"
}

Pretty cool! This gives you a lot of flexibility at runtime to better utilize your logs at different levels to debug and resolve issues. If you so wanted to, you can also target the the ROOT logger at http://localhost:8080/actuator/loggers/ROOT, but of course be aware of the potential noise.

Takeaways (TL;DR)

  • Logs are a vital tool when debugging issues, but only if you can see the right lines when you need them. These might not be at INFO level.
  • Developers should be using the various log levels TRACE, DEBUG, INFO, ERROR accordingly to add additional data points to your log output. In general volume should increase as the level decreases, but more detailed data points will be included.
  • The root logger should be kept at INFO level. Turning on DEBUG logs for our entire application will generate too much noise and will overwhelm both us and our log tooling.
  • Use Spring Boot properties to set specific log levels for particular packages/classes. Pass these in as runtime JVM options for greater flexibility. Note that you will have to restart the app for them to take effect.
  • Spring Boot Actuator gives the most fine-grained control - allowing you both query and update log levels at runtime through it’s admin endpoints.

Useful links:

Read More

30 Useful Java Libraries

A collection of some interesting Java libraries and frameworks (not including the big ones everyone uses such as Spring or JUnit). Hopefully it contains some lesser known libraries which you might not have heard of before and may find useful.

Name Description
ArchUnit Write unit tests to enforce the architecture of your Java code. E.g. all your service classes follow established conventions for naming and package layout through to finding cyclic dependencies.
AssertJ Forget Hamcrest or those provided by JUnit, AssertJ is the only library you will need for writing assertions. Provides countless rich integrations which are easily discoverable through auto-completion.
Awaitility Testing asynchronous systems can be challenging. Awaitility provides a simple DSL that allows you to define the expectations of async operations within tests - without having to deal with threads, timeouts and other concurrency issues.
Caffeine A high performance, near optimal caching library with a modern lambda-based API. Can be easily integrated into Spring applications through its standard caching abstractions.
Eclipse Collections Perhaps not needed in all cases, but can give much improved GC times and overall performance over the standard Java collections in some high throughput pipelines. Also includes immutable versions for better safety in parallel workloads.
Failsafe A lightweight, zero-dependency library for handling all kinds of application failures. Provides a concise API that allows you to wrap executable logic within a number of resilience policies such as Retry, Timeout, Fallback and CircuitBreaker.
FastExcel Apache POI has numerous features for interacting with Excel files, but the API is cumbersome and resource consumption can be a problem for larger files. FastExcel provides a much simpler API to generate and read big Excel files quickly.
Guava Numerous core Java libraries from Google including additional collection types and utilities for concurrency, I/O, hashing, caching and strings. Perhaps becoming less necessary now, but still a great extension over the Java standard library.
Handlebars Simple, logic-less templating engine when you want something a bit more lightweight than Freemarker or Thymeleaf.
HikariCP The best JDBC connection pool out there - fast, simple, reliable and with “zero-overhead”. The default in newer Spring Boot based applications.
Immutables For those that prefer not to use Lombok, but still like all the benefits of immutable classes. Immutables is an annotation processor that generates simple, fast and consistent value objects (data classes) without all the traditional boilerplate.
Faker Generate all kinds of fake data (names, addresses, emails and many more). Really useful when you need to create some throwaway data as part of tests or quick proof-of-concepts/demos.
Jimfs An in-memory file system for Java from Google. Great for unit testing when you don’t want to/can’t mock the underlying filesystem (usually due to static methods from java.nio.Files etc).
Jib Build optimised Docker images from your Java applications directly from Maven/Gradle, without the need for CLI dependencies or Dockerfiles. Jib separates your app into multiple layers (splitting dependencies from classes) for much quicker rebuilds.
jOOR The Java Reflection API is powerful, but often cumbersome to use. jOOR provides a simple, fluent wrapper that gives much more intuitive access to the standard meta information and class structures.
Lombok A set of annotations that helps remove all the typical boilerplate and verbosity in Java applications. Easily auto-generate data classes with constructors, getters/setters, toString, equals/hashcode, builders, loggers and more.
Micrometer A vendor-neutral application metrics facade (think SLF4J, but for metrics). Support for dimensional metrics and Spring Boot Actuator offers a number of useful out-of-the-box integrations.
Mug A small set of utilities over Java 8 Streams and Optionals. Also includes a functional style Maybe implementation.
Picocli A user-friendly framework for building command-line based apps with the JVM. Supports autocompletion, colours, subcommands and is GraalVM compatible.
Resilience4J A lightweight fault tolerance library for Java 8+ and a successor for Netflix Hystrix. Provides decorators for Circuit Breakers, Rate Limiters, Retries and Bulkheads. Can also be used as a circuit breaker implementation within Spring Cloud Circuit Breaker.
Rest Assured A powerful DSL for writing tests over your RESTful API’s.
Retrofit A lightweight and type-safe HTTP client for the JVM. A common go-to within Android applications to interact with external services.
ShedLock A distributed lock that makes sure your scheduled tasks are executed at most once at the same time. Uses an external store like a Redis, JDBC database or Zookeeper for coordination.
StreamEx Many enhancements over the standard Java Stream API that make common tasks shorter and more convenient. Fully compatible with the built-in Java 8 Stream classes, but provides many additional useful methods.
TestContainers Write proper integration tests that use Docker to spin-up throwaway instances of databases, queues, browsers and more. Integrates with JUnit and doesn’t require any extra complex configuration. No longer feel the need to mock out external services!
ThreeTen Extra Additional date-time classes that complement those already in Java 8 (MutableClock, LocalDateRange etc). Curated by the primary author of the new java.time API.
Vavr An essential for all the functional programmers out there. Provides all the essential data types and functional control structures alongside a number of extra modules for integrations with other libraries and frameworks.
Verbal Expressions A library that helps to construct difficult regular expressions with a fluent and human-readable API.
WireMock A library for stubbing and mocking HTTP services. Allows you to construct a standalone local web server that can be used within integration tests to verify expected behaviour against external API’s. Also integrates with Spring Cloud Contract.
Yavi A lambda-based, type-safe validation library with no reflection, annotations or dependencies. For those who are not fans of using the Bean Validation frameworks.

For even more Java libraries check out https://github.com/akullpp/awesome-java

Read More

Gathering Metrics with Micrometer and Spring Boot Actuator

Why Metrics?

Metrics, alongside tracing and logging, form the concept of observability - one of the key cornerstones of DevOps - so hopefully it should be of no surprise to anyone of its importance. As we build larger and more distributed platforms, maintaining sufficient visibility into what the system is doing, when it’s doing it and how well it’s performing becomes more difficult, but also more vital. We can likely no longer just directly query the application for these measurements anymore. Instead, we require aggregations across all our service instances, as well as the ability to drill-down into particular aspects of a metric to gain most insight.

The usefulness of metrics also goes well past just capturing overall system health and resource usage. We can still observe common things like memory and CPU usage, garbage collection and thread utilization metrics - from the SRE perspective these can still go a long way to monitor and manage a production system - but we should also be looking to go much further than just machine level statistics. Well-defined business level metrics, collected automatically by our applications in real-time, are a prerequisite to both our ability to define and measure against SLA’s/SLO’s, alongside our ability to make well informed, data-driven decisions as part of the wider Agile process.

Micrometer

Libraries and frameworks for gathering metrics are nothing new in the Java world. Things like Dropwizard have been around for many years now, but more recently Micrometer has taken a more prominent position - and for good reason. In a lot of ways it supersedes the alternatives.

Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics.

The above description is straight from the Micrometer website and I think the SLF4J analogy sums it up quite well. Just like for logging where we use the core API from SLF4J and then pick and choose whichever concrete logging library we want (Logback, Log4J), the same applies here, but for metrics. The core Micrometer library provides the API’s and interfaces for defining timers, gauges, counters (and more), but out-of-the-box doesn’t do too much more than just store them in memory. After you have defined your metrics, you can then plug-in connectors for a number of external monitoring systems or time-series databases - Azure Monitor, Netflix Atlas, Datadog, Graphite, InfluxDB, Prometheus or even just simple JMX. The main facade stays intact, which then quite rightly decouples the collection of metrics from any/all external consumers.

A nice and modern API is all well and good, but perhaps the larger points to mention here are its support for dimensional metrics (see below) and the first-party support from Spring. From Spring Boot 2.0 (1.x used Dropwizard) each component is integrated with Micrometer and will automatically publish metrics out-of-the-box (again, more on this below).

Hierarchical vs. Dimensional Metrics

Traditionally, metrics have really been not much more than just simple key/value pairs - where the key is usually some dot-separated string namespaced into a particular application area and the value being a simple numerical measurement. This is what’s known as a hierarchical metric, where as the key gets longer, the measurement gets more specific. This used to be the way that most libraries captured metrics and indeed is how Dropwizard and Spring Boot 1.x function. Hierarchical metrics do however have a large limitation - consider the below keys for capturing metrics for HTTP endpoints:

http.requests
http.requests.getwidgets
http.requests.getwidgets.region
http.requests.getwidgets.region.200
http.requests.getwidgets.region.200.username
http.requests.getwidgets.region.200.username.duration
...

From the top metric we measure the most general aggregate - total number of HTTP requests served - and then get gradually more specific by introducing additional fields. These tend to make the metric more insightful, as we are likely going to need to drill-down to a specific region/datacentre or endpoint name and then find the total number of 200 or 500 responses for example. Hopefully you can see the issue here - as we add more fields, the number of metrics we gather starts to grow and grow. The above example is just for a single endpoint, username and status code, they can quickly start to multiply as your system is used. This is what’s known as a cardinality explosion and something to be careful of in applications using hierarchical metrics. It can increase resource consumption and cause pressure on the external systems trying to consume and report on them.

The case for hierarchical metrics gets worse though as the final measurement is tied to the full key. This is fine if it represents exactly what you’re looking for - in the above example, perhaps reporting on the total number of 200 responses for the getwidgets endpoint in a particular region is helpful - but what if you need a different combination?

  • total number of 200 responses served (not just getwidgets)
  • total number of requests made by a particular username (independent of endpoint or status code)
  • overall duration spent handling getwidgets requests (independent of status code or region)

With this set of metrics the above questions are not trivial to answer. You would either be forced to create even more metrics specific to these cases (making the aforementioned cardinality explosion worse), or be forced to perform complex extra aggregation steps in your reporting tool of choice

The idea of dimensional metrics aims to solve these issues and overall make your metrics considerably more powerful. Instead of including fields directly in the key, these are instead added as tags to a more general base metric. The equivalent for the above hierarchical metrics becomes:

http.requests
	endpoint: getwidgets
	method: GET
	status: 200
	username: ryan
	region: rr

This kind of structure is supported out-of-the-box by Micrometer, whereby each tag (or dimension) is added as a parameter in each measurement. The connectors also understand how to send these metrics down to your monitoring system of choice, be it Prometheus which supports dimensional metrics, or just JMX (in which case the dimensional metrics will be flattened out).

The real benefit of dimensional metrics comes into play when partnered with a time series database (such as Prometheus). This allows for the ability to drill-down across dimensions at will, without introducing any of the downsides of hierarchical metrics. Going back to the example above, you can add specific predicates on any number of dimensions and get out the expected aggregation directly. Answering a complex query like “give me all the 500 responses from the getwidgets endpoint, across all regions and from users X,Y and Z from the last 24hrs and compare against the same data from yesterday” becomes trivial.

Spring Integration

As I mentioned before, Micrometer is integrated with Spring Boot Actuator and so requires just the single dependency to get running:

implementation 'org.springframework.boot:spring-boot-starter-actuator'

The Spring Boot Actuator starter dependency does a number of useful things which I’ll cover in future posts, but for now we’ll focus just on the metrics support. By default, Spring configures bindings to begin automatically publishing core metrics across many areas:

  • JVM - memory, buffer pools, thread utilization, classes loaded
  • CPU and File Descriptor usage
  • Logback - number of events logged at each level
  • Spring MVC - HTTP requests/responses and errors
  • HTTP Clients - instrumentation on RestTemplate or WebClient instances
  • Kafka - native metrics on all configured producers/consumers
  • Data Sources - HikariCP bindings for connection pool size and usage
  • Caches - Caffeine/JCache bindings for hit/miss counts, sizes

Hopefully you will agree that this is a fairly comprehensive list already for out-of-the-box behaviour (just dependant on which beans you have created in your app), and this is before you have created any of your own custom metrics.

Publishing Metrics

To try these features out we’ll create a quick contrived demo app to test some of the integration areas. First of all the main Spring app config:

@EnableCaching
@Configuration
public class AppConfig {

    @Value("${spring.application.name}")
    private String applicationName;

    @Bean
    public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
        return registry -> registry.config().commonTags("region", "us-east-1", "application", applicationName);
    }

    @Bean
    public CacheManager cacheManager() {
        CaffeineCacheManager cacheManager = new CaffeineCacheManager();
        cacheManager.setCaffeine(Caffeine.newBuilder().recordStats());
        cacheManager.setCacheNames(Collections.singletonList("widgetCache"));
        return cacheManager;
    }

    @Bean
    public RestTemplate restTemplate(RestTemplateBuilder restTemplateBuilder) {
        return restTemplateBuilder.build();
    }

}

There are a few things going on here:

  • we use a Spring MeterRegistryCustomizer bean to add a common set of tags to all metrics that get published. This is useful to attach instance level details, such as application name and region/datacentre that the service is running in.
  • we create a CacheManager bean using Caffeine as the underlying caching mechanism. The recordStats() portion instructs Caffeine to record metrics that Micrometer can then bind to (also make sure to @EnableCaching).
  • we create a standard RestTemplate bean, making sure to to use the RestTemplateBuilder so that the metrics instrumentation gets added.

Next is a simple @Service class to give something for our endpoints to call:

@Service
@RequiredArgsConstructor
public class WidgetService {

    private static final Logger log = LoggerFactory.getLogger(WidgetService.class);
    private final RestTemplate restTemplate;
    
    @Cacheable("widgetCache")
    public String getWidgets(String user) {
        log.info("Finding widgets for {}", user);
        if(user.equals("bill")) return "widget 1";
        if(user.equals("bob")) return "widget 2";
        else {
            log.error("Could not find widgets for {}", user);
            throw new IllegalArgumentException("Unknown user");
        }
    }

    public String saveWidget(String input) {
        ParameterizedTypeReference<Map<String, String>> responseType = new ParameterizedTypeReference<>() { };
        log.info("Saving widget with input: {}", input);
        ResponseEntity<Map<String, String>> responseEntity =
                restTemplate.exchange("https://httpbin.org/uuid?param={input}", HttpMethod.GET, null, responseType, input);
        log.info("Successfully saved widget!");
        return responseEntity.getBody().get("uuid");
    }
}

Nothing too interesting here. There is a simple getWidgets method to return some static data based on our user input (which also uses our new cache manager). There is a path to generate an exception so we can ensure the corresponding metrics are created. We also make use of the RestTemplate to call a test endpoint from httpbin.org to generate a a basic UUID (simulating some system latency).

Finally, there are two very simple endpoints that just call our service class (and generate some HTTP request metrics):

@RestController
@RequiredArgsConstructor
public class WidgetEndpoint {

    private final WidgetService widgetService;

    @GetMapping(value = "/widgets", produces = "text/plain")
    public ResponseEntity<String> getWidgets(@RequestParam("user") String user) {
        return ResponseEntity.ok(widgetService.getWidgets(user));
    }

    @PostMapping(value = "/widgets", produces = "text/plain")
    public ResponseEntity<String> saveWidget(@RequestParam("input") String input) {
        return ResponseEntity.ok(widgetService.saveWidget(input));
    }
}

If you build, run the app and hit the endpoints, they will work as expected, but we don’t yet have a way to confirm any metrics actually get created.

Spring Boot Actuator Metrics

The Spring Boot Actuator comes built-in with a number of endpoints we can use to manage our application. One of these allows you to view and query all the metrics generated by your application, but it’s disabled by default. Add the following to application.properties to enable it:

spring.application.name=widgetservice
management.endpoints.web.exposure.include=metrics

If you now visit http://localhost:8080/actuator/metrics in your browser, actuator will give you a list of all the top-level metrics (filtered below for brevity).

{
  "names": [
    "cache.gets",
    "cache.size",
    "http.server.requests",
    "jvm.buffer.memory.used",
    "jvm.classes.loaded",
    "jvm.gc.pause",
    "jvm.memory.max",
    "jvm.memory.used",
    "jvm.threads.live",
    "logback.events",
    "process.cpu.usage",
    "process.uptime",
    "system.cpu.usage",
    "tomcat.sessions.active.current",
    "tomcat.sessions.active.max"
  ]
}

Most of these should look familiar when compared with the list we saw before - they are metrics publishing automatically by our application without having to perform any real work. The actuator metrics endpoint also lets you view specific metrics: http://localhost:8080/actuator/metrics/jvm.memory.used

{
  "name": "jvm.memory.used",
  "description": "The amount of used memory",
  "baseUnit": "bytes",
  "measurements": [
    {
      "statistic": "VALUE",
      "value": 89722280
    }
  ],
  "availableTags": [
    {
      "tag": "area",
      "values": [
        "heap",
        "nonheap"
      ]
    },
    {
      "tag": "application",
      "values": [
        "widgetservice"
      ]
    },
    {
      "tag": "id",
      "values": [
        "G1 Old Gen",
        "CodeHeap 'non-profiled nmethods'",
        "G1 Survivor Space",
        "Compressed Class Space",
        "Metaspace",
        "G1 Eden Space",
        "CodeHeap 'non-nmethods'"
      ]
    },
    {
      "tag": "region",
      "values": [
        "us-east-1"
      ]
    }
  ]
}

At the top we can see the aggregate measured value for total memory used and then we can also see a number of tags/dimensions associated with it (including the common set we added in the app config). This is where dimensional metrics get interesting, as you can now drill-down specifically into particular areas. For example, to view just heap memory usage: http://localhost:8080/actuator/metrics/jvm.memory.used?tag=area:heap

{
  "name": "jvm.memory.used",
  "description": "The amount of used memory",
  "baseUnit": "bytes",
  "measurements": [
    {
      "statistic": "VALUE",
      "value": 47795248
    }
  ],
  "availableTags": [
    {
      "tag": "id",
      "values": [
        "G1 Old Gen",
        "G1 Survivor Space",
        "G1 Eden Space"
      ]
    }
  ]
}

This gives us a more specific measurement and a different set of tags that we could use to further drill-down into the metric e.g. http://localhost:8080/actuator/metrics/jvm.memory.used?tag=area:heap&tag=id:G1%20Eden%20Space to inspect the total memory used in the heap G1 Eden Space only.

Cache Metrics

Now to actually call some of our own code to generate some other metrics. We can use the getWidgets endpoint to make use of our cache: http://localhost:8080/widgets?user=bill. If we call that multiple times, we should be getting the result straight from the cache instead of calling the method itself. We can consult the cache metrics to confirm: http://localhost:8080/actuator/metrics/cache.gets?tag=cache:widgetCache&tag=result:hit

{
  "name": "cache.gets",
  "description": "The number of times cache lookup methods have returned a cached value.",
  "baseUnit": null,
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 3
    }
  ],
  "availableTags": [
    {
      "tag": "name",
      "values": [
        "widgetCache"
      ]
    }
  ]
}

Here we are using the the cache.gets metric, with tag predicates on the cache name and result, in order to assess how well our cache is utilized. Similarly, you could also inspect the misses tag to generate the cache hit/miss ratio. You can also use the cache.size metric to observe trends in how many values are in loaded into your caches: http://localhost:8080/actuator/metrics/cache.size?tag=name:widgetCache

HTTP Client Metrics

Next up we can call the saveWidget endpoint to make use of our instrumented RestTemplate to call an external service: http://localhost:8080/widgets?input=something. If you call the /actuator/metrics endpoint again, you should see a new entry http.client.requests:

{
  "name": "http.client.requests",
  "description": "Timer of RestTemplate operation",
  "baseUnit": "seconds",
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 1
    },
    {
      "statistic": "TOTAL_TIME",
      "value": 0.6493909
    },
    {
      "statistic": "MAX",
      "value": 0.6493909
    }
  ],
  "availableTags": [
    {
      "tag": "method",
      "values": [
        "GET"
      ]
    },
    {
      "tag": "clientName",
      "values": [
        "httpbin.org"
      ]
    },
    {
      "tag": "uri",
      "values": [
        "/uuid?param={input}"
      ]
    },
    {
      "tag": "outcome",
      "values": [
        "SUCCESS"
      ]
    },
    {
      "tag": "status",
      "values": [
        "200"
      ]
    }
  ]
}

The previous metrics we’ve looked at were simple counters or gauges, but now we have timer which gives us more measurement values. Not only do you get the count, but also the total time taken (allowing you to compute averages) and a max observed value. The tags are also more interesting here - similar to the above you can easily drill-down to answer some interesting questions:

  • how many requests where made to client X in total within a certain time period?
  • how many requests where made to the /uuid endpoint on client X which failed?
  • how long did the application spend waiting for client X to respond to our calls?

HTTP Request Metrics

By now we should have produced enough traffic to generate some good HTTP request metrics. You could also call the getWidgets endpoint with an unknown user to generate an exception: http://localhost:8080/widgets?user=none. The http.server.requests metric is used to capture measurements for our HTTP endpoints:

{
  "name": "http.server.requests",
  "description": null,
  "baseUnit": "seconds",
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 23
    },
    {
      "statistic": "TOTAL_TIME",
      "value": 0.7761524969999999
    },
    {
      "statistic": "MAX",
      "value": 0.0031665
    }
  ],
  "availableTags": [
    {
      "tag": "exception",
      "values": [
        "None",
        "IllegalArgumentException",
        "BadOperationRequestException"
      ]
    },
    {
      "tag": "method",
      "values": [
        "GET",
        "POST"
      ]
    },
    {
      "tag": "uri",
      "values": [
        "/saveWidget",
        "/getWidgets",
        "/actuator/metrics",
        "/**"
      ]
    },
    {
      "tag": "outcome",
      "values": [
        "CLIENT_ERROR",
        "SUCCESS",
        "SERVER_ERROR"
      ]
    },
    {
      "tag": "status",
      "values": [
        "404",
        "200",
        "400",
        "500"
      ]
    }
  ]
}

Similarly to the http.client.requests metric, we have a timer meter and a number of useful tags to inspect:

  • how many requests resulted in 500 status codes within a certain time period?
  • how many times was a POST request made to the /widget resource?
  • how many requests made to the /widget resource resulted in IllegalArgumentException?
  • how long did it take to respond to all GET requests which resulted in 500 status codes?

Logging Metrics

A perhaps lesser known, but I think very useful metric is logback.events which keeps track of all lines that were logged and at what level. We can then query for example just for ERROR log lines across generated over time: http://localhost:8080/actuator/metrics/logback.events?tag=level:error:

{
  "name": "logback.events",
  "description": "Number of error level events that made it to the logs",
  "baseUnit": "events",
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 4
    }
  ]
}

DataSource Metrics

Finally, we have metrics produced by any JDBC DataSource instances available in our app. This can be done either through the Spring Boot spring.datasource auto-configuration, or creating your own. HikariCP exposes a number of metrics over it’s connection pool, such as hikaricp.connections.usage and hikaricp.connections.idle. This can be helpful if you’re trying to track down connection pool exhaustion issues in your applications.

Creating Custom Metrics

All the built-in Spring metrics are great, but you will also need to create and publish your own custom metrics. Spring creates a Micrometer MeterRegistry instance by default, that you can then inject and use to create custom counters, gauges and timers. As a very simple and contrived example, we can create some counters in our widget service:

private final MeterRegistry meterRegistry;

public String getWidgets(String user) {
    log.info("Finding widgets for {}", user);
    if(user.equals("bill")) {
        meterRegistry.counter("custom.widgets", "user", "bill").increment();
        return "widget 1";
    }
    if(user.equals("bob")) {
        meterRegistry.counter("custom.widgets", "user", "bob").increment();
        return "widget 2";
    }
    else {
        log.error("Could not find widgets for {}", user);
        throw new IllegalArgumentException("Unknown user");
    }
}

Here we use our injected MeterRegistry to create a new counter called widgets (this will be the base name of your metric). We also provide a custom tag for our metric so we can then filter based on username. Note that you generally don’t want to create tags where there can be too many possible values. In the example above, we know there can only ever be two, but with too many it can cause some resource usage issues.

Once the metric is published, you should be able to see it show up in actuator just like any other metric: http://localhost:8080/actuator/metrics/custom.widgets

{
  "name": "custom.widgets",
  "description": null,
  "baseUnit": null,
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 2
    }
  ],
  "availableTags": [
    {
      "tag": "user",
      "values": [
        "bob",
        "bill"
      ]
    }
  ]
}

Since it’s a counter meter, we just have the single measurement alongside the custom user tag that we passed in. We could then drill-down as needed into how many widgets a specific user had asked for etc. (note in this case the max is always 2 since we have used the @Cacheable annotation).

Finally, in Spring Boot we can also create a custom timer surrounding a method call using the @Timed annotation. You also need to create an aspect bean for Micrometer to recognise and instrument the methods:

@Bean
public TimedAspect timedAspect(MeterRegistry registry) {
    return new TimedAspect(registry);
}

@Timed
public String saveWidget(String input) {}

We can then view the timing data in the method.timed base metric, which can be filtered down by tags on class and method names: http://localhost:8080/actuator/metrics/method.timed?tag=method:saveWidget

{
  "name": "method.timed",
  "description": null,
  "baseUnit": "seconds",
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 3
    },
    {
      "statistic": "TOTAL_TIME",
      "value": 1.0327454
    },
    {
      "statistic": "MAX",
      "value": 0.600250599
    }
  ],
  "availableTags": [
    {
      "tag": "exception",
      "values": [
        "none"
      ]
    },
    {
      "tag": "class",
      "values": [
        "com.example.demo.WidgetService"
      ]
    }
  ]
}

This just scratches the surface of the Micrometer API to define your own metrics. Take a look at the docs page https://micrometer.io/docs/concepts which goes into more detail about each available meter type.

Aggregation and Visualization

Capturing and publishing metrics is one thing, but in a multi-instance environment we also need some way of aggregating all the measurements and allow us to visualize state across the whole platform. The built-in Spring Boot Actuator endpoint is useful as a quick admin tool, but generally we need to use the Micrometer connectors and other external monitoring tools to get the full benefits. There are a lot of options here, but Prometheus tends to be the go-to as a time series database and monitoring tool. For Spring based apps, you just need to add the appropriate Micrometer dependency and the rest is handled for you:

implementation 'io.micrometer:micrometer-registry-prometheus'

Actuator will then expose a dedicated /actuator/prometheus endpoint which can be called by Prometheus to gather the latest measurements (requires extra config not covered here). There are generally two ways to get your metrics into a time series database - the push model or the pull model. In the push model, your application itself has to send all your metrics downstream. This might result in slightly more real-time measurements, but also introduces issues if the monitoring tool becomes unavailable. Your app might be continuously blocking a thread trying to publish historical metrics downstream, even though it may be down or otherwise overwhelmed. Additionally, you also introduce unneeded coupling between your application and the external monitoring tooling. In the push model, which tools like Prometheus favour, your application exposes a simple and optimized endpoint for your metrics and then it’s the responsibility of the external tool to continuously poll for the latest measurements. Crucially, this minimises any negative impact to your application itself and avoids the coupling. It does however require a service discovery mechanism (such as Eureka or Zookeeper) to be configured in order for the tooling to pull metrics from each of the application instances.

Prometheus provides it’s own basic visualization tooling, but instead Grafana can be used as another layer in the stack to offer rich querying and graphing feature sets, alongside alerting and reporting. Prometheus still stores the raw data, but you can put together dashboards in Grafana, such as the one below for a generic Spring Boot app.

Grafana Spring Boot Dashboard

Takeaways (TL;DR)

  • metrics are an essential part of any application, not only in assessing health and stability, but also to make data-informed decisions on the future direction of your application
  • for JVM based apps, Micrometer is the way to go for metrics collection (especially when using Spring)
  • dimensional metrics are extremely powerful - and resolve longstanding issues with conventional hierarchical metrics
  • Spring Boot 2+ and Spring Actuator have great built-in support for metrics with Micrometer and out-of-the-box integration for a number of key areas
  • default application metrics should only form part of the overall picture and must be partnered with business level metrics - which are well-defined against established and agreed SLA’s
  • an external time series database such as Prometheus should be used to aggregate metrics across all application instances and allow tools such as Grafana to provide dynamic visualisations on top of the underlying data.

Useful links:

Read More