Ryan Harrison My blog, portfolio and technology related ramblings

Prometheus Monitoring Guide Part 1 - Installation and Instrumentation

An open-source systems monitoring and alerting toolkit. Now a standalone open source project and maintained independently of any company. Analyse how your applications and infrastructure is performing based on the metrics they publish. Particularly suitable for large distributed systems with ephemeral instances.

  • metrics are stored in a multi-dimensional data model instead of hierarchical, where each measurement consists of a name and a number of key/value pairs
    • http.server.requests.count(uri="/endpoint", method="GET") 12 instead of http.server.requests.GET.endpoint=12
    • backed by it’s own custom time-series database built specifically for metrics
    • provides it’s own query language PromQL as a read-only and flexible query language for aggregation on time series data
  • no reliance on distributed storage; single server nodes are autonomous
  • time series collection happens via a pull model over HTTP
  • targets are discovered via service discovery or static configuration
  • multiple modes of graphing and dashboarding support
  • Alternative to Graphite, InfluxDB
  • Ecosystem provides a number of pre-built exporters that expose metrics ready for Prometheus to scrape

Architecture

Prometheus Architecture

  • main Prometheus server consists of a time series database storing each of the captured measurements
    • alongside a scraper which pulls samples from external applications, hosts, platforms
    • and an HTTP server which allows operations to be performed on the tsdb (e.g querying by PromQL)
  • Prometheus is a single-instance component; all data is stored on local node storage
    • if you need to scale, recommendation is to spin up multiple separate Prometheus instances with different/replicated targets
  • operates in a pull model, whereby Prometheus is setup to periodically scrape the metrics from all target application instances
    • therefore has to know about the location of all active instances via service discovery
    • more easily tell if a target is down, can manually go to a target and inspect its health with a web browser
    • application itself has no knowledge of Prometheus apart from an endpoint exposing the latest metrics snapshot (/metrics)
  • the pull model can be problematic for short-lived/batch operations which may not be alive long enough to be scraped
    • Pushgateway component can be used as a middle-man - gets pushed metrics from jobs and forwards them to Prometheus
  • service discovery integration into Kubernetes, AWS, Azure to understand current landscape of all target nodes
  • after metrics are scraped and stored in tsdb, can be made available to users through web UI/Grafana/API
  • AlertManager component can be used with a set of rules querying the metrics to generate system alerts
    • performs de-duplication of alerts, throttling, silences etc
    • forwards to email, PagerDuty, Slack etc

Installation

  • provided as a single Go binary from https://prometheus.io/download/ so can be executed directly
    • ./prometheus
    • by default looks for prometheus.yml config file in the same directory (by default will only scrape Prometheus itself)
  • Prometheus Web UI available at http://localhost:9090
    • allows you to view current targets, configuration and run queries. For more complex visualizations use Grafana
  • or can be executed through Docker
    • docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Configuration

  • all Prometheus config provided in prometheus.yml file
  • Prometheus can reload its configuration at runtime. If the new configuration is not well-formed, the changes will not be applied
    • reload is triggered by sending a SIGHUP to the Prometheus process (kill -HUP <pid>)
    • or sending an HTTP POST request to the /-/reload endpoint when the --web.enable-lifecycle flag is enabled
global: # applied to all targets unless overridden
    scrape_interval: 15s # how often to scrape each target
    evaluation_interval: 15s # how often to evaluate predefined rules

scrape_configs: # a set of jobs denoting where to scrape metrics from
    - job_name: "prometheus" # a group of targets, also added as a label to each measurement
      metrics_path: "/metrics" # default
      static_configs: # a hardcoded host/port or could be service discovery
          - targets: ["localhost:9090"] # uses metrics path to pull metrics, by default http

Service Discovery

  • static_configs does not scale to more dynamic environments where instances are added/removed frequently
  • Prometheus can integrate with service discovery mechanisms to automatically update it’s view of of running instances
    • when new instances are added Prometheus will begin scraping, when lost from discovery the time series will also be removed
    • built-in integrations with Consul, Azure, AWS or file based if custom mechanism required
  • JSON/YAML file can be published by the platform specifying all targets to scrape from. Prometheus uses it to automatically update targets

Prometheus Service Discovery

- job_name: 'myapp'
  file_sd_configs:
    refresh_interval: 5m # default
    - files:
      - file_sd.yml

Where the published file file_sd.json contains all the current live targets and any attached labels:

[
    {
        "targets": ["<host>"],
        "labels": {
            "<labelname>": "<labelvalue>"
        }
    }
]

Relabelling

Prometheus needs to know what to scrape, and that’s where service discovery and relabel_configs come in. Relabel configs allow you to select which targets you want scraped, and what the target labels will be. So if you want to say scrape this type of machine but not that one, use relabel_configs.

metric_relabel_configs by contrast are applied after the scrape has happened, but before the data is ingested by the storage system. So if there are some expensive metrics you want to drop, or labels coming from the scrape itself (e.g. from the /metrics page) that you want to manipulate that’s where metric_relabel_configs applies.

So as a simple rule of thumb: relabel_config happens before the scrape, metric_relabel_configs happens after the scrape

relabel_configs:
    - source_labels: [__meta_kubernetes_service_label_app]
      regex: nginx
      action: keep
    - source_labels: [__meta_kubernetes_endpoint_port_name]
      regex: web
      action: keep
    - source_labels: [__meta_ec2_public_ip] # rewrite private ip to public from service discovery
      regex: "(.*)"
      replacement: "${1}:9100"
      target_label: __address__

You can perform the following action operations:

  • keep: Keep a matched target or series, drop all others
  • drop: Drop a matched target or series, keep all others
  • replace: Replace or rename a matched label with a new one defined by the target_label and replacement parameters
  • labelkeep: Match the regex against all label names, drop all labels that don’t match (ignores source_labels and applies to all label names)
  • labeldrop: Match the regex against all label names, drop all labels that match (ignores source_labels and applies to all label names)

metric_relabel_configs can be used to drop unnecessary time-series before ingestion:

- job_name: cadvisor
  metric_relabel_configs:
      - source_labels: [container_label_JenkinsId]
        regex: ".+"
        action: drop
      - source_labels: [__name__]
        regex: "(container_tasks_state|container_memory_failures_total)"
        action: drop

https://grafana.com/docs/grafana-cloud/billing-and-usage/prometheus/usage-reduction/

Instrumentation

  • there are two ways in which application metrics can be exposed for Prometheus
    • use a client library directly in the application to create and expose a Prometheus endpoint (usually /metrics)
    • use an intermediate proxy exporter instance instrumenting the target application and converting to the Prometheus metrics format
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

Conventions and Practices

  • metrics names should start with a letter can be followed with any number of letters, numbers and underscores
  • metrics must have unique names, client libraries should report an error if you try to register the same one twice
  • should have a suffix describing the unit, in plural form (e.g _bytes or _total)
  • should represent the same logical thing-being-measured across all label dimensions
  • every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values

Exporters

There are a number of libraries and servers which help in exporting existing metrics from third-party systems as Prometheus metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus metrics directly (Linux kernel) as cannot modify source etc

  • https://prometheus.io/docs/instrumenting/exporters/
  • an exporter is a separate process dedicated entirely to pulling metrics from a target system and exposing them as Prometheus metrics
    • “proxy service” converting the target interface into one that can be scraped by Prometheus
  • Common exporters (some official):

    • Node Exporter - hardware and OS metrics exposed by Unix kernels, CPU load, memory, I/O, network
    • MySQL Expoter - database metrics, queries ran, timings, pool sizes
    • Blackbox Exporter - probing of endpoints over HTTP, DNS, TCP, ICMP
    • Kafka, Kafka Lag, Nginx, Postgres, Jenkins, AWS, Graphite, JMX
  • for cases which you have access to modify the application code, instrumentation must be added in order to add Prometheus metrics
    • can use existing application frameworks to expose default common metrics (Spring Actuator)
    • use the client libraries to add custom metrics to be exposed (Go, Java, Python, Ruby)
  • other metrics libraries offer a facade over the definition of metrics and allow pluggable Prometheus exporter to be added instead (Micrometer)
    • don’t have to use Prometheus client library directly for increased flexibility in overall monitoring solution

Metric Types

The client libraries offer four core metric types. These are currently only differentiated in the client libraries (to enable APIs tailored to the usage of the specific types) and in the wire protocol. The Prometheus server does not yet make use of the type information and flattens all data into untyped time series.

Counter

  • a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart
  • for example, you can use a counter to represent the number of requests served, tasks completed, or errors
  • do not use a counter to expose a value that can decrease. For example, do not use a counter for the number of currently running processes; instead use a gauge

Gauge

  • a metric that represents a single numerical value that can arbitrarily go up and down
  • gauges are typically used for measured values like temperatures or current memory usage, but also “counts” that can go up and down, like the number of concurrent requests

Histogram

  • samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values
  • a histogram with a base metric name of <basename> exposes multiple time series during a scrape:

    • cumulative counters for the observation buckets, exposed as <basename>_bucket{le="<upper inclusive bound>"}
    • the total sum of all observed values, exposed as <basename>_sum
    • the count of events that have been observed, exposed as <basename>_count (identical to <basename>_bucket{le="+Inf"} above)
  • use the histogram_quantile() function to calculate quantiles from histograms or even aggregations of histograms across instances
  • when operating on buckets, remember that the histogram is cumulative

Summary

  • similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window
  • a summary with a base metric name of <basename> exposes multiple time series during a scrape:
    • streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as <basename>{quantile="<φ>"}
    • the total sum of all observed values, exposed as <basename>_sum
    • the count of events that have been observed, exposed as <basename>_count
  • https://prometheus.io/docs/practices/histograms/
    • if you need to aggregate, choose histograms.
    • otherwise, choose a histogram if you have an idea of the range and distribution of values that will be observed. Choose a summary if you need an accurate quantile, no matter what the range and distribution of the values is.
Read More

How to Dynamically Change Log Levels at Runtime with Spring Boot

As we can all probably agree, logging is a vital part of any application. It provides visibility into what our component is doing at any point in time and is an invaluable tool when trying to debug issues. When faced with a problem in production, likely one of the first things that will be done is to check the logs for any errors and to hopefully locate the source of the issue. It’s therefore vital that our logs are not only relevant and useful (perhaps a topic for another day), but are also visible and accessible when we need them the most. Hopefully we will be lucky and the logs will include all the necessary details to pinpoint the issue - or maybe they won’t. This may be decided by the level at which the loggers are configured.

By default, most if not all of our loggers will probably be set to INFO level. This is fine when everything is working correctly - including only the key operations/tasks and not creating too much noise as to overload the logging tools - but perhaps not so great when you need to work an issue. Although the developers may have included some basic INFO level log output within the problematic area, this may not include enough data to properly trace through an erroneous transaction. The same developer may have also added some additional DEBUG log lines (hopefully, if not they should do) that give us some additional details into what was being done, but these may not be visible to us when the application is deployed within an environment.

I bet most people reading this can remember that instance in which they were debugging an issue, were able to focus down on a particular segment of code, only to find that the log line that gave them that critical piece of information was set to DEBUG level and so was unavailable to them. We may have 6 different log levels available, but without finer control over their configuration at runtime, we may as well only have half that.

Spring Boot Logging Properties

Log configuration within Spring Boot applications takes a number of different forms these days, but in this post we’ll focus just on the main ones (excluding specific logback.xml modifications etc). A lot of the same configuration changes you would previously have made in those dedicated files can now also be set as simple application properties. For example, to change the global root log level to DEBUG, we can just add the following to application.properties (or YAML equivalent):

logging.level.root=DEBUG

This however has similar problems to the conventional logback.xml approach - to update these configuration settings you would have to rebuild and redeploy your entire application - not something that’s plausible in production. The difference here is that these all just standard Spring Boot properties, so in much the same way as you would elsewhere, you can set them as environment variables or pass them in as JVM options:

-Dlogging.level.root=DEBUG

This gets us somewhat closer to where we want to be - we can now control the log level without redeploying - but it also introduces another problem. This enables DEBUG log output for the entire application. If you’ve ever done this before, even in a basic Spring application, you will know that the output is extremely noisy:

  • At default INFO level a simple Spring Boot app generates 14 lines of log output at startup
  • At DEBUG level the same application generates over 2440 log lines - just at startup!
DEBUG 6852 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Creating shared instance of singleton bean 'org.springframework.boot.autoconfigure.web.servlet.MultipartAutoConfiguration'
DEBUG 6852 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Creating shared instance of singleton bean 'spring.servlet.multipart-org.springframework.boot.autoconfigure.web.servlet.MultipartProperties'
2021-01-30 17:36:25.708 DEBUG 6852 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Autowiring by type from bean name 'org.springframework.boot.autoconfigure.web.servlet.MultipartAutoConfiguration' via constructor to bean named 'spring.servlet.multipart-org.springframework.boot.autoconfigure.web.servlet.MultipartProperties'
DEBUG 6852 --- [1)-192.168.1.22] sun.rmi.transport.tcp                    : RMI TCP Connection(1)-192.168.1.22: (port 65235) op = 80
DEBUG 6852 --- [1)-192.168.1.22] javax.management.remote.rmi              : [javax.management.remote.rmi.RMIConnectionImpl@3440316d: connectionId=rmi://192.168.1.22  1] closing.
DEBUG 6852 --- [1)-192.168.1.22] javax.management.remote.rmi              : [javax.management.remote.rmi.RMIConnectionImpl@3440316d: connectionId=rmi://192.168.1.22  1] closed.
DEBUG 6852 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Autowiring by type from bean name 'errorPageCustomizer' via factory method to bean named 'dispatcherServletRegistration'
DEBUG 6852 --- [           main] o.apache.tomcat.util.IntrospectionUtils  : IntrospectionUtils: setProperty(class org.apache.coyote.http11.Http11NioProtocol port=8080)

If you look at the some of the output above you will notice that most of it is just noise and won’t help you resolve any real issues. In fact, it may hinder you when trying to search through all the volume just to find those few useful lines.

In reality setting the root log level to DEBUG is not something you really want to do unless you feel some need to closely inspect the Spring startup sequence. It belongs at the default INFO level where the frameworks and other libraries won’t overwhelm you with data. What you actually probably wanted was to only set the new log level for certain targeted areas of your application - be it a set of packages or even a particular class in much the way you can in logback.xml. Spring Boot also allows you to use the same property syntax for this. For example, to enable DEBUG output only in our service packages or specifically within our WidgetService class, we can instead add the following to application.properties:

logging.level.com.example.demo.service=DEBUG
logging.level.com.example.demo.service.WidgetService=DEBUG

Checking the log output again, you should see most of the noise from before disappear - leaving only our targeted package at the lower log level. Much more useful! You might notice that this really just has the same effect as updating the logback.xml file directly, and you would be correct. The big difference here however is that when combined with the VM argument trick from before, you now have much fine-grained control over your loggers after your application is deployed.

INFO 8768  --- [nio-8080-exec-1] com.example.demo.service.WidgetService   : Finding widgets for bob
DEBUG 8768 --- [nio-8080-exec-1] com.example.demo.service.WidgetService   : Background working happening to find widgets..
DEBUG 8768 --- [nio-8080-exec-1] com.example.demo.service.WidgetService   : Found user bill, returning widget 2

This is another improvement, but still has one big problem - you need to restart your application for the changes to be applied. This might be ok for some cases, but for example if the underlying issue was due some particular bad state of a local cache, restarting would reset the component, hide the underlying cause and make it much harder to reproduce further. Depending on the issue it may be possible to get your application back into the same state again after rehydrating caches etc, but sometimes “turn it off and on again” just hides the larger underlying problem.

As a last improvement, it would be great if you could dynamically change the log levels for particular targeted areas at runtime - without having to rebuild, redeploy or restart.

Spring Boot Actuator

Actuator, amongst other things, allows you to do just this through one of it’s administration endpoints. To enable this functionality we first need to add the starter dependency to the build.gradle file:

implementation 'org.springframework.boot:spring-boot-starter-actuator'

By default actuator doesn’t expose any of it’s admin endpoints, so we also need to explicitly enable the loggers feature(and others if you need to). Add the following into application.properties:

management.endpoints.web.exposure.include=loggers

If we now visit http://localhost:8080/actuator in the browser (or as GET request from elsewhere) you should see just the logger endpoints are now enabled:

{
    "_links": {
        "self": {
            "href": "http://localhost:8080/actuator",
            "templated": false
        },
        "loggers": {
            "href": "http://localhost:8080/actuator/loggers",
            "templated": false
        },
        "loggers-name": {
            "href": "http://localhost:8080/actuator/loggers/{name}",
            "templated": true
        }
    }
}

Actuator hooks into the log system used within your app (be it Logback or Log4j) and allows you to interact with these endpoints as an API to query and modify the underlying log levels.

Viewing Current Log Levels

First of all visit http://localhost:8080/actuator/loggers (via GET request) to see all the loggers configured within your application and their current levels (it will likely be quite a large list):

{
    "levels": ["OFF", "ERROR", "WARN", "INFO", "DEBUG", "TRACE"],
    "loggers": {
        "ROOT": {
            "configuredLevel": "INFO",
            "effectiveLevel": "INFO"
        },
        "com.example.demo.DemoApplication": {
            "configuredLevel": null,
            "effectiveLevel": "INFO"
        },
        "com.example.demo.service": {
            "configuredLevel": null,
            "effectiveLevel": "INFO"
        },
        "com.example.demo.service.WidgetService": {
            "configuredLevel": null,
            "effectiveLevel": "INFO"
        }
    }
}

In the above extract of the output we can see all the available levels, the current level for each area of our application and finally whether or not a custom level has been applied. These are all null for now since we are yet to override anything. This is not too helpful, but does help in identifying the logger names that we can later customize.

If you want to explicitly target a specific logger, you can add the name to the path (as described in the first actuator output). For example http://localhost:8080/actuator/loggers/com.example.demo.service will return:

{
    "configuredLevel": null,
    "effectiveLevel": "INFO"
}

Modifying Log Levels

In the above examples we have been using simple GET requests to query the current log configuration. The /actuator/loggers/{name} endpoint however also lets you send a POST request that allows you to update the configured level for a particular logger. For example, to change our service loggers to DEBUG level, send a POST request to http://localhost:8080/actuator/loggers/com.example.demo.service with the JSON body:

{
    "configuredLevel": "DEBUG"
}

The corresponding cURL command would be (note the Content-Type header needs to be set as the payload is a JSON object):

curl -i -X POST -H 'Content-Type: application/json' -d '{"configuredLevel": "DEBUG"}'
  http://localhost:8080/actuator/loggers/com.example.demo.service
  HTTP/1.1 204

If successful, the API will return a 204 No Content response. Checking the application logs again after some additional calls were made to the service class, you should see the same DEBUG log output as before, whilst all other output remains at the default INFO level:

INFO 10848  --- [nio-8080-exec-2] com.example.demo.service.WidgetService   : Finding widgets for bob
DEBUG 10848 --- [nio-8080-exec-2] com.example.demo.service.WidgetService   : Background working happening to find widgets..
DEBUG 10848 --- [nio-8080-exec-2] com.example.demo.service.WidgetService   : Found user bill, returning widget 2

To confirm the update, we can also try querying actuator again with the same logger name to view the updated configuration - GET http://localhost:8080/actuator/loggers/com.example.demo.service:

{
    "configuredLevel": "DEBUG",
    "effectiveLevel": "DEBUG"
}

Pretty cool! This gives you a lot of flexibility at runtime to better utilize your logs at different levels to debug and resolve issues. If you so wanted to, you can also target the the ROOT logger at http://localhost:8080/actuator/loggers/ROOT, but of course be aware of the potential noise.

Takeaways (TL;DR)

  • Logs are a vital tool when debugging issues, but only if you can see the right lines when you need them. These might not be at INFO level.
  • Developers should be using the various log levels TRACE, DEBUG, INFO, ERROR accordingly to add additional data points to your log output. In general volume should increase as the level decreases, but more detailed data points will be included.
  • The root logger should be kept at INFO level. Turning on DEBUG logs for our entire application will generate too much noise and will overwhelm both us and our log tooling.
  • Use Spring Boot properties to set specific log levels for particular packages/classes. Pass these in as runtime JVM options for greater flexibility. Note that you will have to restart the app for them to take effect.
  • Spring Boot Actuator gives the most fine-grained control - allowing you both query and update log levels at runtime through it’s admin endpoints.

Useful links:

Read More

30 Useful Java Libraries

A collection of some interesting Java libraries and frameworks (not including the big ones everyone uses such as Spring or JUnit). Hopefully it contains some lesser known libraries which you might not have heard of before and may find useful.

Name Description
ArchUnit Write unit tests to enforce the architecture of your Java code. E.g. all your service classes follow established conventions for naming and package layout through to finding cyclic dependencies.
AssertJ Forget Hamcrest or those provided by JUnit, AssertJ is the only library you will need for writing assertions. Provides countless rich integrations which are easily discoverable through auto-completion.
Awaitility Testing asynchronous systems can be challenging. Awaitility provides a simple DSL that allows you to define the expectations of async operations within tests - without having to deal with threads, timeouts and other concurrency issues.
Caffeine A high performance, near optimal caching library with a modern lambda-based API. Can be easily integrated into Spring applications through its standard caching abstractions.
Eclipse Collections Perhaps not needed in all cases, but can give much improved GC times and overall performance over the standard Java collections in some high throughput pipelines. Also includes immutable versions for better safety in parallel workloads.
Failsafe A lightweight, zero-dependency library for handling all kinds of application failures. Provides a concise API that allows you to wrap executable logic within a number of resilience policies such as Retry, Timeout, Fallback and CircuitBreaker.
FastExcel Apache POI has numerous features for interacting with Excel files, but the API is cumbersome and resource consumption can be a problem for larger files. FastExcel provides a much simpler API to generate and read big Excel files quickly.
Guava Numerous core Java libraries from Google including additional collection types and utilities for concurrency, I/O, hashing, caching and strings. Perhaps becoming less necessary now, but still a great extension over the Java standard library.
Handlebars Simple, logic-less templating engine when you want something a bit more lightweight than Freemarker or Thymeleaf.
HikariCP The best JDBC connection pool out there - fast, simple, reliable and with “zero-overhead”. The default in newer Spring Boot based applications.
Immutables For those that prefer not to use Lombok, but still like all the benefits of immutable classes. Immutables is an annotation processor that generates simple, fast and consistent value objects (data classes) without all the traditional boilerplate.
Faker Generate all kinds of fake data (names, addresses, emails and many more). Really useful when you need to create some throwaway data as part of tests or quick proof-of-concepts/demos.
Jimfs An in-memory file system for Java from Google. Great for unit testing when you don’t want to/can’t mock the underlying filesystem (usually due to static methods from java.nio.Files etc).
Jib Build optimised Docker images from your Java applications directly from Maven/Gradle, without the need for CLI dependencies or Dockerfiles. Jib separates your app into multiple layers (splitting dependencies from classes) for much quicker rebuilds.
jOOR The Java Reflection API is powerful, but often cumbersome to use. jOOR provides a simple, fluent wrapper that gives much more intuitive access to the standard meta information and class structures.
Lombok A set of annotations that helps remove all the typical boilerplate and verbosity in Java applications. Easily auto-generate data classes with constructors, getters/setters, toString, equals/hashcode, builders, loggers and more.
Micrometer A vendor-neutral application metrics facade (think SLF4J, but for metrics). Support for dimensional metrics and Spring Boot Actuator offers a number of useful out-of-the-box integrations.
Mug A small set of utilities over Java 8 Streams and Optionals. Also includes a functional style Maybe implementation.
Picocli A user-friendly framework for building command-line based apps with the JVM. Supports autocompletion, colours, subcommands and is GraalVM compatible.
Resilience4J A lightweight fault tolerance library for Java 8+ and a successor for Netflix Hystrix. Provides decorators for Circuit Breakers, Rate Limiters, Retries and Bulkheads. Can also be used as a circuit breaker implementation within Spring Cloud Circuit Breaker.
Rest Assured A powerful DSL for writing tests over your RESTful API’s.
Retrofit A lightweight and type-safe HTTP client for the JVM. A common go-to within Android applications to interact with external services.
ShedLock A distributed lock that makes sure your scheduled tasks are executed at most once at the same time. Uses an external store like a Redis, JDBC database or Zookeeper for coordination.
StreamEx Many enhancements over the standard Java Stream API that make common tasks shorter and more convenient. Fully compatible with the built-in Java 8 Stream classes, but provides many additional useful methods.
TestContainers Write proper integration tests that use Docker to spin-up throwaway instances of databases, queues, browsers and more. Integrates with JUnit and doesn’t require any extra complex configuration. No longer feel the need to mock out external services!
ThreeTen Extra Additional date-time classes that complement those already in Java 8 (MutableClock, LocalDateRange etc). Curated by the primary author of the new java.time API.
Vavr An essential for all the functional programmers out there. Provides all the essential data types and functional control structures alongside a number of extra modules for integrations with other libraries and frameworks.
Verbal Expressions A library that helps to construct difficult regular expressions with a fluent and human-readable API.
WireMock A library for stubbing and mocking HTTP services. Allows you to construct a standalone local web server that can be used within integration tests to verify expected behaviour against external API’s. Also integrates with Spring Cloud Contract.
Yavi A lambda-based, type-safe validation library with no reflection, annotations or dependencies. For those who are not fans of using the Bean Validation frameworks.

For even more Java libraries check out https://github.com/akullpp/awesome-java

Read More

Gathering Metrics with Micrometer and Spring Boot Actuator

Note: Also check out this follow-up post which covers how to query and create reusable dashboards from your metrics: Aggregating and Visualizing Spring Boot Metrics with Prometheus and Grafana

Why Metrics?

Metrics, alongside tracing and logging, form the concept of observability - one of the key cornerstones of DevOps - so hopefully it should be of no surprise to anyone of its importance. As we build larger and more distributed platforms, maintaining sufficient visibility into what the system is doing, when it’s doing it and how well it’s performing becomes more difficult, but also more vital. We can likely no longer just directly query the application for these measurements anymore. Instead, we require aggregations across all our service instances, as well as the ability to drill-down into particular aspects of a metric to gain most insight.

The usefulness of metrics also goes well past just capturing overall system health and resource usage. We can still observe common things like memory and CPU usage, garbage collection and thread utilization metrics - from the SRE perspective these can still go a long way to monitor and manage a production system - but we should also be looking to go much further than just machine level statistics. Well-defined business level metrics, collected automatically by our applications in real-time, are a prerequisite to both our ability to define and measure against SLA’s/SLO’s, alongside our ability to make well informed, data-driven decisions as part of the wider Agile process.

Micrometer

Libraries and frameworks for gathering metrics are nothing new in the Java world. Things like Dropwizard have been around for many years now, but more recently Micrometer has taken a more prominent position - and for good reason. In a lot of ways it supersedes the alternatives.

Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics.

The above description is straight from the Micrometer website and I think the SLF4J analogy sums it up quite well. Just like for logging where we use the core API from SLF4J and then pick and choose whichever concrete logging library we want (Logback, Log4J), the same applies here, but for metrics. The core Micrometer library provides the API’s and interfaces for defining timers, gauges, counters (and more), but out-of-the-box doesn’t do too much more than just store them in memory. After you have defined your metrics, you can then plug-in connectors for a number of external monitoring systems or time-series databases - Azure Monitor, Netflix Atlas, Datadog, Graphite, InfluxDB, Prometheus or even just simple JMX. The main facade stays intact, which then quite rightly decouples the collection of metrics from any/all external consumers.

A nice and modern API is all well and good, but perhaps the larger points to mention here are its support for dimensional metrics (see below) and the first-party support from Spring. From Spring Boot 2.0 (1.x used Dropwizard) each component is integrated with Micrometer and will automatically publish metrics out-of-the-box (again, more on this below).

Hierarchical vs. Dimensional Metrics

Traditionally, metrics have really been not much more than just simple key/value pairs - where the key is usually some dot-separated string namespaced into a particular application area and the value being a simple numerical measurement. This is what’s known as a hierarchical metric, where as the key gets longer, the measurement gets more specific. This used to be the way that most libraries captured metrics and indeed is how Dropwizard and Spring Boot 1.x function. Hierarchical metrics do however have a large limitation - consider the below keys for capturing metrics for HTTP endpoints:

http.requests
http.requests.getwidgets
http.requests.getwidgets.region
http.requests.getwidgets.region.200
http.requests.getwidgets.region.200.username
http.requests.getwidgets.region.200.username.duration
...

From the top metric we measure the most general aggregate - total number of HTTP requests served - and then get gradually more specific by introducing additional fields. These tend to make the metric more insightful, as we are likely going to need to drill-down to a specific region/datacentre or endpoint name and then find the total number of 200 or 500 responses for example. Hopefully you can see the issue here - as we add more fields, the number of metrics we gather starts to grow and grow. The above example is just for a single endpoint, username and status code, they can quickly start to multiply as your system is used. This is what’s known as a cardinality explosion and something to be careful of in applications using hierarchical metrics. It can increase resource consumption and cause pressure on the external systems trying to consume and report on them.

The case for hierarchical metrics gets worse though as the final measurement is tied to the full key. This is fine if it represents exactly what you’re looking for - in the above example, perhaps reporting on the total number of 200 responses for the getwidgets endpoint in a particular region is helpful - but what if you need a different combination?

  • total number of 200 responses served (not just getwidgets)
  • total number of requests made by a particular username (independent of endpoint or status code)
  • overall duration spent handling getwidgets requests (independent of status code or region)

With this set of metrics the above questions are not trivial to answer. You would either be forced to create even more metrics specific to these cases (making the aforementioned cardinality explosion worse), or be forced to perform complex extra aggregation steps in your reporting tool of choice

The idea of dimensional metrics aims to solve these issues and overall make your metrics considerably more powerful. Instead of including fields directly in the key, these are instead added as tags to a more general base metric. The equivalent for the above hierarchical metrics becomes:

http.requests
	endpoint: getwidgets
	method: GET
	status: 200
	username: ryan
	region: rr

This kind of structure is supported out-of-the-box by Micrometer, whereby each tag (or dimension) is added as a parameter in each measurement. The connectors also understand how to send these metrics down to your monitoring system of choice, be it Prometheus which supports dimensional metrics, or just JMX (in which case the dimensional metrics will be flattened out).

The real benefit of dimensional metrics comes into play when partnered with a time series database (such as Prometheus). This allows for the ability to drill-down across dimensions at will, without introducing any of the downsides of hierarchical metrics. Going back to the example above, you can add specific predicates on any number of dimensions and get out the expected aggregation directly. Answering a complex query like “give me all the 500 responses from the getwidgets endpoint, across all regions and from users X,Y and Z from the last 24hrs and compare against the same data from yesterday” becomes trivial.

Spring Integration

As I mentioned before, Micrometer is integrated with Spring Boot Actuator and so requires just the single dependency to get running:

implementation 'org.springframework.boot:spring-boot-starter-actuator'

The Spring Boot Actuator starter dependency does a number of useful things which I’ll cover in future posts, but for now we’ll focus just on the metrics support. By default, Spring configures bindings to begin automatically publishing core metrics across many areas:

  • JVM - memory, buffer pools, thread utilization, classes loaded
  • CPU and File Descriptor usage
  • Logback - number of events logged at each level
  • Spring MVC - HTTP requests/responses and errors
  • HTTP Clients - instrumentation on RestTemplate or WebClient instances
  • Kafka - native metrics on all configured producers/consumers
  • Data Sources - HikariCP bindings for connection pool size and usage
  • Caches - Caffeine/JCache bindings for hit/miss counts, sizes

Hopefully you will agree that this is a fairly comprehensive list already for out-of-the-box behaviour (just dependant on which beans you have created in your app), and this is before you have created any of your own custom metrics.

Publishing Metrics

To try these features out we’ll create a quick contrived demo app to test some of the integration areas. First of all the main Spring app config:

@EnableCaching
@Configuration
public class AppConfig {

    @Value("${spring.application.name}")
    private String applicationName;

    @Bean
    public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
        return registry -> registry.config().commonTags("region", "us-east-1", "application", applicationName);
    }

    @Bean
    public CacheManager cacheManager() {
        CaffeineCacheManager cacheManager = new CaffeineCacheManager();
        cacheManager.setCaffeine(Caffeine.newBuilder().recordStats());
        cacheManager.setCacheNames(Collections.singletonList("widgetCache"));
        return cacheManager;
    }

    @Bean
    public RestTemplate restTemplate(RestTemplateBuilder restTemplateBuilder) {
        return restTemplateBuilder.build();
    }

}

There are a few things going on here:

  • we use a Spring MeterRegistryCustomizer bean to add a common set of tags to all metrics that get published. This is useful to attach instance level details, such as application name and region/datacentre that the service is running in.
  • we create a CacheManager bean using Caffeine as the underlying caching mechanism. The recordStats() portion instructs Caffeine to record metrics that Micrometer can then bind to (also make sure to @EnableCaching).
  • we create a standard RestTemplate bean, making sure to to use the RestTemplateBuilder so that the metrics instrumentation gets added.

Next is a simple @Service class to give something for our endpoints to call:

@Service
@RequiredArgsConstructor
public class WidgetService {

    private static final Logger log = LoggerFactory.getLogger(WidgetService.class);
    private final RestTemplate restTemplate;

    @Cacheable("widgetCache")
    public String getWidgets(String user) {
        log.info("Finding widgets for {}", user);
        if(user.equals("bill")) return "widget 1";
        if(user.equals("bob")) return "widget 2";
        else {
            log.error("Could not find widgets for {}", user);
            throw new IllegalArgumentException("Unknown user");
        }
    }

    public String saveWidget(String input) {
        ParameterizedTypeReference<Map<String, String>> responseType = new ParameterizedTypeReference<>() { };
        log.info("Saving widget with input: {}", input);
        ResponseEntity<Map<String, String>> responseEntity =
                restTemplate.exchange("https://httpbin.org/uuid?param={input}", HttpMethod.GET, null, responseType, input);
        log.info("Successfully saved widget!");
        return responseEntity.getBody().get("uuid");
    }
}

Nothing too interesting here. There is a simple getWidgets method to return some static data based on our user input (which also uses our new cache manager). There is a path to generate an exception so we can ensure the corresponding metrics are created. We also make use of the RestTemplate to call a test endpoint from httpbin.org to generate a a basic UUID (simulating some system latency).

Finally, there are two very simple endpoints that just call our service class (and generate some HTTP request metrics):

@RestController
@RequiredArgsConstructor
public class WidgetEndpoint {

    private final WidgetService widgetService;

    @GetMapping(value = "/widgets", produces = "text/plain")
    public ResponseEntity<String> getWidgets(@RequestParam("user") String user) {
        return ResponseEntity.ok(widgetService.getWidgets(user));
    }

    @PostMapping(value = "/widgets", produces = "text/plain")
    public ResponseEntity<String> saveWidget(@RequestParam("input") String input) {
        return ResponseEntity.ok(widgetService.saveWidget(input));
    }
}

If you build, run the app and hit the endpoints, they will work as expected, but we don’t yet have a way to confirm any metrics actually get created.

Spring Boot Actuator Metrics

The Spring Boot Actuator comes built-in with a number of endpoints we can use to manage our application. One of these allows you to view and query all the metrics generated by your application, but it’s disabled by default. Add the following to application.properties to enable it:

spring.application.name=widgetservice
management.endpoints.web.exposure.include=metrics

If you now visit http://localhost:8080/actuator/metrics in your browser, actuator will give you a list of all the top-level metrics (filtered below for brevity).

{
    "names": [
        "cache.gets",
        "cache.size",
        "http.server.requests",
        "jvm.buffer.memory.used",
        "jvm.classes.loaded",
        "jvm.gc.pause",
        "jvm.memory.max",
        "jvm.memory.used",
        "jvm.threads.live",
        "logback.events",
        "process.cpu.usage",
        "process.uptime",
        "system.cpu.usage",
        "tomcat.sessions.active.current",
        "tomcat.sessions.active.max"
    ]
}

Most of these should look familiar when compared with the list we saw before - they are metrics publishing automatically by our application without having to perform any real work. The actuator metrics endpoint also lets you view specific metrics: http://localhost:8080/actuator/metrics/jvm.memory.used

{
    "name": "jvm.memory.used",
    "description": "The amount of used memory",
    "baseUnit": "bytes",
    "measurements": [
        {
            "statistic": "VALUE",
            "value": 89722280
        }
    ],
    "availableTags": [
        {
            "tag": "area",
            "values": ["heap", "nonheap"]
        },
        {
            "tag": "application",
            "values": ["widgetservice"]
        },
        {
            "tag": "id",
            "values": [
                "G1 Old Gen",
                "CodeHeap 'non-profiled nmethods'",
                "G1 Survivor Space",
                "Compressed Class Space",
                "Metaspace",
                "G1 Eden Space",
                "CodeHeap 'non-nmethods'"
            ]
        },
        {
            "tag": "region",
            "values": ["us-east-1"]
        }
    ]
}

At the top we can see the aggregate measured value for total memory used and then we can also see a number of tags/dimensions associated with it (including the common set we added in the app config). This is where dimensional metrics get interesting, as you can now drill-down specifically into particular areas. For example, to view just heap memory usage: http://localhost:8080/actuator/metrics/jvm.memory.used?tag=area:heap

{
    "name": "jvm.memory.used",
    "description": "The amount of used memory",
    "baseUnit": "bytes",
    "measurements": [
        {
            "statistic": "VALUE",
            "value": 47795248
        }
    ],
    "availableTags": [
        {
            "tag": "id",
            "values": ["G1 Old Gen", "G1 Survivor Space", "G1 Eden Space"]
        }
    ]
}

This gives us a more specific measurement and a different set of tags that we could use to further drill-down into the metric e.g. http://localhost:8080/actuator/metrics/jvm.memory.used?tag=area:heap&tag=id:G1%20Eden%20Space to inspect the total memory used in the heap G1 Eden Space only.

Cache Metrics

Now to actually call some of our own code to generate some other metrics. We can use the getWidgets endpoint to make use of our cache: http://localhost:8080/widgets?user=bill. If we call that multiple times, we should be getting the result straight from the cache instead of calling the method itself. We can consult the cache metrics to confirm: http://localhost:8080/actuator/metrics/cache.gets?tag=cache:widgetCache&tag=result:hit

{
    "name": "cache.gets",
    "description": "The number of times cache lookup methods have returned a cached value.",
    "baseUnit": null,
    "measurements": [
        {
            "statistic": "COUNT",
            "value": 3
        }
    ],
    "availableTags": [
        {
            "tag": "name",
            "values": ["widgetCache"]
        }
    ]
}

Here we are using the the cache.gets metric, with tag predicates on the cache name and result, in order to assess how well our cache is utilized. Similarly, you could also inspect the misses tag to generate the cache hit/miss ratio. You can also use the cache.size metric to observe trends in how many values are in loaded into your caches: http://localhost:8080/actuator/metrics/cache.size?tag=name:widgetCache

HTTP Client Metrics

Next up we can call the saveWidget endpoint to make use of our instrumented RestTemplate to call an external service: http://localhost:8080/widgets?input=something. If you call the /actuator/metrics endpoint again, you should see a new entry http.client.requests:

{
    "name": "http.client.requests",
    "description": "Timer of RestTemplate operation",
    "baseUnit": "seconds",
    "measurements": [
        {
            "statistic": "COUNT",
            "value": 1
        },
        {
            "statistic": "TOTAL_TIME",
            "value": 0.6493909
        },
        {
            "statistic": "MAX",
            "value": 0.6493909
        }
    ],
    "availableTags": [
        {
            "tag": "method",
            "values": ["GET"]
        },
        {
            "tag": "clientName",
            "values": ["httpbin.org"]
        },
        {
            "tag": "uri",
            "values": ["/uuid?param={input}"]
        },
        {
            "tag": "outcome",
            "values": ["SUCCESS"]
        },
        {
            "tag": "status",
            "values": ["200"]
        }
    ]
}

The previous metrics we’ve looked at were simple counters or gauges, but now we have timer which gives us more measurement values. Not only do you get the count, but also the total time taken (allowing you to compute averages) and a max observed value. The tags are also more interesting here - similar to the above you can easily drill-down to answer some interesting questions:

  • how many requests where made to client X in total within a certain time period?
  • how many requests where made to the /uuid endpoint on client X which failed?
  • how long did the application spend waiting for client X to respond to our calls?

HTTP Request Metrics

By now we should have produced enough traffic to generate some good HTTP request metrics. You could also call the getWidgets endpoint with an unknown user to generate an exception: http://localhost:8080/widgets?user=none. The http.server.requests metric is used to capture measurements for our HTTP endpoints:

{
    "name": "http.server.requests",
    "description": null,
    "baseUnit": "seconds",
    "measurements": [
        {
            "statistic": "COUNT",
            "value": 23
        },
        {
            "statistic": "TOTAL_TIME",
            "value": 0.7761524969999999
        },
        {
            "statistic": "MAX",
            "value": 0.0031665
        }
    ],
    "availableTags": [
        {
            "tag": "exception",
            "values": ["None", "IllegalArgumentException", "BadOperationRequestException"]
        },
        {
            "tag": "method",
            "values": ["GET", "POST"]
        },
        {
            "tag": "uri",
            "values": ["/saveWidget", "/getWidgets", "/actuator/metrics", "/**"]
        },
        {
            "tag": "outcome",
            "values": ["CLIENT_ERROR", "SUCCESS", "SERVER_ERROR"]
        },
        {
            "tag": "status",
            "values": ["404", "200", "400", "500"]
        }
    ]
}

Similarly to the http.client.requests metric, we have a timer meter and a number of useful tags to inspect:

  • how many requests resulted in 500 status codes within a certain time period?
  • how many times was a POST request made to the /widget resource?
  • how many requests made to the /widget resource resulted in IllegalArgumentException?
  • how long did it take to respond to all GET requests which resulted in 500 status codes?

Logging Metrics

A perhaps lesser known, but I think very useful metric is logback.events which keeps track of all lines that were logged and at what level. We can then query for example just for ERROR log lines across generated over time: http://localhost:8080/actuator/metrics/logback.events?tag=level:error:

{
    "name": "logback.events",
    "description": "Number of error level events that made it to the logs",
    "baseUnit": "events",
    "measurements": [
        {
            "statistic": "COUNT",
            "value": 4
        }
    ]
}

DataSource Metrics

Finally, we have metrics produced by any JDBC DataSource instances available in our app. This can be done either through the Spring Boot spring.datasource auto-configuration, or creating your own. HikariCP exposes a number of metrics over it’s connection pool, such as hikaricp.connections.usage and hikaricp.connections.idle. This can be helpful if you’re trying to track down connection pool exhaustion issues in your applications.

Creating Custom Metrics

All the built-in Spring metrics are great, but you will also need to create and publish your own custom metrics. Spring creates a Micrometer MeterRegistry instance by default, that you can then inject and use to create custom counters, gauges and timers. As a very simple and contrived example, we can create some counters in our widget service:

private final MeterRegistry meterRegistry;

public String getWidgets(String user) {
    log.info("Finding widgets for {}", user);
    if(user.equals("bill")) {
        meterRegistry.counter("custom.widgets", "user", "bill").increment();
        return "widget 1";
    }
    if(user.equals("bob")) {
        meterRegistry.counter("custom.widgets", "user", "bob").increment();
        return "widget 2";
    }
    else {
        log.error("Could not find widgets for {}", user);
        throw new IllegalArgumentException("Unknown user");
    }
}

Here we use our injected MeterRegistry to create a new counter called widgets (this will be the base name of your metric). We also provide a custom tag for our metric so we can then filter based on username. Note that you generally don’t want to create tags where there can be too many possible values. In the example above, we know there can only ever be two, but with too many it can cause some resource usage issues.

Once the metric is published, you should be able to see it show up in actuator just like any other metric: http://localhost:8080/actuator/metrics/custom.widgets

{
    "name": "custom.widgets",
    "description": null,
    "baseUnit": null,
    "measurements": [
        {
            "statistic": "COUNT",
            "value": 2
        }
    ],
    "availableTags": [
        {
            "tag": "user",
            "values": ["bob", "bill"]
        }
    ]
}

Since it’s a counter meter, we just have the single measurement alongside the custom user tag that we passed in. We could then drill-down as needed into how many widgets a specific user had asked for etc. (note in this case the max is always 2 since we have used the @Cacheable annotation).

Finally, in Spring Boot we can also create a custom timer surrounding a method call using the @Timed annotation. You also need to create an aspect bean for Micrometer to recognise and instrument the methods:

@Bean
public TimedAspect timedAspect(MeterRegistry registry) {
    return new TimedAspect(registry);
}

@Timed
public String saveWidget(String input) {}

We can then view the timing data in the method.timed base metric, which can be filtered down by tags on class and method names: http://localhost:8080/actuator/metrics/method.timed?tag=method:saveWidget

{
    "name": "method.timed",
    "description": null,
    "baseUnit": "seconds",
    "measurements": [
        {
            "statistic": "COUNT",
            "value": 3
        },
        {
            "statistic": "TOTAL_TIME",
            "value": 1.0327454
        },
        {
            "statistic": "MAX",
            "value": 0.600250599
        }
    ],
    "availableTags": [
        {
            "tag": "exception",
            "values": ["none"]
        },
        {
            "tag": "class",
            "values": ["com.example.demo.WidgetService"]
        }
    ]
}

This just scratches the surface of the Micrometer API to define your own metrics. Take a look at the docs page https://micrometer.io/docs/concepts which goes into more detail about each available meter type.

Aggregation and Visualization

See this follow-up post for a complete look at visualizing Spring Boot metrics: Aggregating and Visualizing Spring Boot Metrics with Prometheus and Grafana

Capturing and publishing metrics is one thing, but in a multi-instance environment we also need some way of aggregating all the measurements and allow us to visualize state across the whole platform. The built-in Spring Boot Actuator endpoint is useful as a quick admin tool, but generally we need to use the Micrometer connectors and other external monitoring tools to get the full benefits. There are a lot of options here, but Prometheus tends to be the go-to as a time series database and monitoring tool. For Spring based apps, you just need to add the appropriate Micrometer dependency and the rest is handled for you:

implementation 'io.micrometer:micrometer-registry-prometheus'

Actuator will then expose a dedicated /actuator/prometheus endpoint which can be called by Prometheus to gather the latest measurements (requires extra config not covered here). There are generally two ways to get your metrics into a time series database - the push model or the pull model. In the push model, your application itself has to send all your metrics downstream. This might result in slightly more real-time measurements, but also introduces issues if the monitoring tool becomes unavailable. Your app might be continuously blocking a thread trying to publish historical metrics downstream, even though it may be down or otherwise overwhelmed. Additionally, you also introduce unneeded coupling between your application and the external monitoring tooling. In the push model, which tools like Prometheus favour, your application exposes a simple and optimized endpoint for your metrics and then it’s the responsibility of the external tool to continuously poll for the latest measurements. Crucially, this minimises any negative impact to your application itself and avoids the coupling. It does however require a service discovery mechanism (such as Eureka or Zookeeper) to be configured in order for the tooling to pull metrics from each of the application instances.

Prometheus provides it’s own basic visualization tooling, but instead Grafana can be used as another layer in the stack to offer rich querying and graphing feature sets, alongside alerting and reporting. Prometheus still stores the raw data, but you can put together dashboards in Grafana, such as the one below for a generic Spring Boot app.

Grafana Spring Boot Dashboard

Takeaways (TL;DR)

  • metrics are an essential part of any application, not only in assessing health and stability, but also to make data-informed decisions on the future direction of your application
  • for JVM based apps, Micrometer is the way to go for metrics collection (especially when using Spring)
  • dimensional metrics are extremely powerful - and resolve longstanding issues with conventional hierarchical metrics
  • Spring Boot 2+ and Spring Actuator have great built-in support for metrics with Micrometer and out-of-the-box integration for a number of key areas
  • default application metrics should only form part of the overall picture and must be partnered with business level metrics - which are well-defined against established and agreed SLA’s
  • an external time series database such as Prometheus should be used to aggregate metrics across all application instances and allow tools such as Grafana to provide dynamic visualisations on top of the underlying data.

Useful links:

Read More

The Importance of Integration Testing Part 1 - HTTP Endpoints

Bert is a new joiner within the team. As his first task he’s been assigned to create a simple endpoint to expose some existing data onto one of the front-end screens. His manager says that this should give him a gentle introduction to the area whilst also giving opportunity to gain familiarity with the codebase and SDLC processes. She mentions that he should keep in regular contact with Brenda who will be handling the UI changes. Bert agrees, he’s already quite familiar with Spring Boot from his previous position - creating a simple new endpoint in an existing service should be straightforward he thinks to himself. It should look good if he can get this finished and signed off within a couple days.

Bert clones the repo and notices that the service doesn’t yet expose any API endpoints (Spring @RestController) so he quickly creates a new one, using the existing service pointed out to him by his teammate Ben (simplified below).

@RestController
@RequiredArgsConstructor
public class SomeResultEndpoint {

  private final SomeResultService someResultService;

  @GetMapping(value = "/someResult")
  public SomeResult getSomeResult() {
    return someResultService.getSomeResult();
  }
}

During a quick catch-up, Ben informs Bert about the team’s core shared library, which includes various components that should automatically handle all the other key requirements for him. This includes things like authentication, object mapping and error handling. Bert starts up the service locally, pointing to the dev database, hits the endpoint and can see data returns successfully. Content that everything looks to be working ok, Bert moves on to writing tests for the change. He knows from reading the team’s SDLC documentation that the pull request build will fail if it sees any drop in code coverage.

The Bad

Generally the first thing done in these situations is the trusty unit test cases - and that’s exactly what Bert does initially. In Java, tools like JUnit and Mockito (amongst many others) make this kind of change straightforward to unit test, just mock out the service dependency and ensure the controller behaves as expected. Bert comes up with the following simple test case:

class SomeResultEndpointTest {

  private SomeResultEndpoint someResultEndpoint;

  private SomeResultService someResultService;

  @BeforeEach
  void setUp() {
    someResultService = mock(SomeResultService.class);
    someResultEndpoint = new SomeResultEndpoint(someResultService);
  }

  @Test
  void getSomeResult() {
    SomeResult expected = TestUtils.getSomeResultData();
    when(someResultService.getSomeResult()).thenReturn(expected);
    SomeResult actual = someResultEndpoint.getSomeResult();
    assertThat(actual).isEqualTo(expected);
    verify(someResultService, times(1)).getSomeResult();
  }
}

Initially, Bert tried to construct some test SomeResult instances himself, but quickly realised that the data structure was complex and he was unfamiliar with what a real-world scenario would look like. Thankfully, Ben pointed him towards some existing helper methods, created alongside the original service, that looked to create some realistic data and populate most of the fields.

Bert ran the test suite, and as expected, everything passed successfully. But Bert had some issues - what about the negative scenarios? What about the HTTP status codes etc? All this was being handled by either Spring or the core shared library components. Bert created a simple negative test case, but then began to realise that there wasn’t really much more that he could add here:

@Test
void getSomeResult_propagates_exception() {
  when(someResultService.getSomeResult()).thenThrow(new RuntimeException("error");
  assertThrows(RuntimeException.class, () -> someResultEndpoint.getSomeResult());
}

Bert commits his change, pushes it and creates a pull request for the team to review. The build passes successfully and code coverage is 100% on Bert’s changes - great! The pull request gets approved, merged and deployed into the dev/qa environments. Bert pings Brenda (who is doing the UI changes) that the change is ready to begin integrating with.

The Ugly

The next day Bert gets a message from Brenda - “the service isn’t working properly”, she explains. “Every time I call it in QA I get a 500 error returned”. Bert quickly pulls up the logs and notices many exceptions being thrown from the endpoint - all of which seem to be related to Jackson when the response SomeResult is converted to JSON.

com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion (StackOverflowError) (through reference chain: java.util.concurrent.ConcurrentHashMap[" "]->com.java.sample.OraganisationStructures$$EnhancerBySpringCGLIB$$99c7d84b["$$beanFactory"]->org.springframework.beans.factory.support.DefaultListableBeanFactory["forwardRef"]at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:706)
at at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155)
at at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:704)
at at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:690)
at at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155)

A quick search of the error indicated a circular reference issue when trying to serialize the SomeResult instance. Sure enough, the data structure contained self references. Ben explained that it was actually a linked-list type structure, useful for the existing processing but perhaps not ideal for an endpoint representation. “Why didn’t this issue come up when you were testing?”, asked Ben. Some investigation later, Bert found that the dev database had old data, preventing such instances from ever being created when he ran it locally. The test utility methods did create this scenario, but the tests themselves had failed to pick up the issue. Ben suggests, “I guess we have never tried converting that model to JSON before”. Bert quickly resolved the issue, pushed his changes and informed Brenda about the fix - noting that he didn’t have to change any of the tests as part of his change.

The next day Bert gets another message from Brenda - “I’m now seeing some strange behaviour when unauthorized users call the endpoint. I expect the response to conform to our standard error model, but I just get a wall of text and a 500 error instead of the 401 status code I expect”. Again Bert checks the logs, he sees new AuthorizationException stack traces coming from the shared library component which performs the authorization checks. This looks like expected behaviour, Bert ponders, but why doesn’t the response get mapped correctly? Ben points him towards the AuthExceptionMapper class in the shared library, which converts the exception to the common error model. After some debugging, Bert found that the mapper had not been correctly configured in the service. Ben explained, “that mapper is still quite new, I guess it never got added since that service hasn’t exposed any endpoints before”. Again, Bert quickly fixes the issue, pushes his changes and informs Brenda - again noting that he did not have to change any of his test cases as part of the fix.

The Good

After these fixes, the new endpoint works as expected and Brenda is able to integrate successfully, but Bert is quite rightly less than satisfied. Not only did an otherwise straightforward change take much longer than it should, even now he could not be confident that it works as expected in all scenarios, let alone 6 months down the line. Bert brings up the issue in the sprint retrospective, highlighting a number of areas that are not covered by the current test suites - even though the code coverage metrics might suggest they are:

  • JSON object marshalling - both in request and response bodies
  • URL formatting and HTTP request methods
  • usage of query and/or path parameters
  • any exception scenario requiring use of separate exception mappers or handlers (response format and HTTP status codes)
  • any additional functionality provided in controller advice, filters, converters etc.

The team agrees, clearly there is a need for automated integration tests for these endpoints in addition to the conventional unit tests. Thankfully, Spring comes with a number of built-in solutions - one being MockMvc, which the team end-up using. A typical problem for integration tests like this is the need to spin-up a full Spring container - for most apps that might mean queue listeners get created, calling out to other services during startup etc - not something that is ideal for a short-lived test suite. Bert suggests configuring the app in such a way that allows for a “test” instance to be started (without all the external dependencies etc.) - to make creation of a proper integration test suite much easier. But in the meantime, MockMvc has a nice way around it:

Using this annotation will disable full auto-configuration and instead apply only configuration relevant to MVC tests (i.e. @Controller, @ControllerAdvice, @JsonComponent, Converter/GenericConverter, Filter, WebMvcConfigurer and HandlerMethodArgumentResolver beans but not @Component, @Service or @Repository beans).

Bert explains that this in effect gives you a a cut-down Spring container which just creates the components required to support MVC/REST functionality. No need for the rest of your beans to be created, those can be mocked out as usual. Bert comes up with the following additional test cases for the original change:

@WebMvcTest(SomeResultEndpoint.class)
class SomeResultEndpointIntTest {

  @Autowired
  private MockMvc mockMvc;

  @MockBean
  private SomeResultService someResultService;

  @MockBean
  private AuthorizationService authorizationService;

  @Test
  void getSomeResult_succeeds() throws Exception {
    when(authorizationService.isAuthorized(anyString())).thenReturn(true);
    SomeResult expected = TestUtils.getSomeResultData();
    when(someResultService.getSomeResult()).thenReturn(expected);

    this.mockMvc.perform(get("/someResult"))
      .andExpect(status().isOk())
      .andExpect(content().string(equalTo(marshalObject(expected))));
  }

  @Test
  void getSomeResult_notfound() throws Exception {
    when(authorizationService.isAuthorized(anyString())).thenReturn(true);
    when(someResultService.getSomeResult()).thenReturn(null);

    mockMvc.perform(get("/someResult"))
      .andExpect(status().isNotFound());
  }

  @Test
  void getSomeResult_unauthorized() throws Exception {
    when(authorizationService.isAuthorized(anyString())).thenReturn(false);
    SomeResult expected = TestUtils.getSomeResultData();
    when(someResultService.getSomeResult()).thenReturn(expected);

    mockMvc.perform(get("/someResult"))
      .andExpect(status().isUnauthorized());
  }
}

The above are three very simple test cases, but crucially provide coverage in a number of key areas:

  • we have the correct URL and are able to respond to GET requests over HTTP
  • the response can be successfully serialized to the required response format and matches the expected output (JSON in this case, but it could be anything)
  • exceptions are handled as expected, the HTTP response status codes are correct

Bert highlights that Spring has a number of other methods of handling the above - @SpringBootTest if you want to really startup the full application (combine with @ActiveProfiles) alongside utils like TestRestTemplate, but the team agrees that even just the above is a vast improvement.

Takeaways (TL;DR)

The example above is somewhat contrived, but really these scenarios are not unrealistic at all. There can easily be large areas of your application that your tests don’t actually cover - likely parts that are deeply reliant on ‘magic’ from your framework of choice and/or rely on (at least part of) your application to be running in order to test. How does your code integrate with the framework or other libraries? Is your configuration correct? You need something more than unit tests for this kind of thing and you want to know about such issues as early as possible.

  • Unit tests are great, but not enough on their own
  • Code coverage metrics will lie to you
  • Making any change/fix that doesn’t require you to also modify a test is a red flag
  • We rely more and more on the ‘magic from the framework’, but how do we test it? How do we know we’ve configured it correctly?
  • Above the core business logic (unit cases), every HTTP endpoint should have automated tests covering URL descriptors, methods, object marshalling, exception handling etc.
  • Spring MockMvc is a neat way of producing integration tests for your endpoints, but is not the only solution, see
Read More