the:chris:walker ↩

Metrics, Metrics, Metrics

I have been a software engineer for a long time. In every project I have worked on, statistics have been important. I won’t get into the difficulty of actually interpretting the statistics, instead this post is about collection.

The very first metrics implementation I did was before TimeseriesDBs were popular and we simply wrote out the metrics in a flat file. Our dimensionality was small, but cardinality was high. For example our metrics all looked like this:

<timestamp>:<metric_name>:<user_id|"anon">:<value>\n

Usually value was 1 indictating simply that this was a counter to increase. Given that you always need time and a name for the series, we only had one other dimension, the user_id. However this is super high cardinality, as there is one for every user.

This was basic but actually served very well. It was a lot of data though. But that was OK, and we could ask questions we hadn’t thought of up front by running back through past data. We also could do a lot of incremental processing by time-bucketing. So things were OK, and we could group users and produce cohort level stats.

But these were just application metrics and not really monitoring data, very much focused on user behaviour than they behaviour of the application itself.

More recently I have been more interested in metrics around application performance rather than user behaviour.

So this model was not flexible enough when solutions like graphite seemed to have it covered. So graphite was my first foray into this field. And it failed completely. My fault really, I just ended up with such high cardinality that it couldn’t cope. So I looked for a potential solution that would allow me to slice my data as I wanted.

We were using an ELK stack for logging, so I simply hijacked that and it worked surprisingly well. I simply emitted a log with type = metric and used kibana to create dashboards and stats. We didn’t have quite the offline re-processing of all data although technically possible, but we did have enough performance to aggregate useful application metrics.

Then timeseries databases got popular and InfluxDB, Prometheus, OpenTSDB all got a lot of press. I toyed with all three, and decided I liked the design of prometheus the most and start to export some metrics.

Now prometheus uses a pull method of retrieving metrics from hosts, which is the opposite of most systems where the clients push their data to the metrics system. Instead Prometheus pulls metrics out of the clients.

This is a fantastic system for a number of reasons - most important to me is developer friendliness. While developing I don’t need a metrics server, the client doesn’t care if metrics are collected, it simple enables them to be collected. This means I can hit an HTTP endpoint and my metrics are their for the checking. And with no extra configuration a new deployment is found by the configured prometheus server with service discovery rules, and the metrics collected from it.

However it has a fatal drawback, which I believe is a direct consequence of this design desicion. It cannot accurately store point-event measurements. By this I mean any metric where each reading is unrelated to any other and the readings may be dense or sparse in time. For example, timing API call latency – each call is orthogonal to any other and the density of readings is simply however often the API is called.

It is common to want to store such timing information, and then calculate histograms and percentiles over periods. Prometheus offers 2 ways to do this, both flawed.

  • the Histogram type: requires that you choose your buckets up front and any greater precision is lost. I.e. <10ms, 10ms-100ms, >100ms. now you can’t tell what proportion of events took longer than 80ms. Not can you accurately say what the 95th percentile value is.
  • the Summary type: which precomputes percentiles, but again, you need to allocate the buckets up front. Even worse than the Histogram type, you cannot aggregate the values as they are already percentiles.

Both of these types are limited becausee prometheus cannot collect discrete values, but only expose metrics for collection at the schedule of the server.

What we really need is to emit and record a timing event for every single measurement. Then we can retrospectively aggregate across multiple systems. This was something we could do with the basic system we had at first.

The next promising solution appears to be TimescaleDB, which is basically a Postgres extension to make storing timeseries data in postgres efficient. Because of this, we can also store high cardinality fields and query them richly – in theory at least…