Nassim Taleb is famous for a few things - as the author of the 2007 blockbuster “The Black Swan” he examined the impact of rare high-impact events and how they tend to be explained away in retrospect with pithy or simplistic narratives. He’s also renowned for his reluctance to give interviews, so it’s a pleasure to hear him expound on his theories at length in this recording with James Altucher.
https://nassimtaleb.org/2014/09/podcast-nassim-taleb-james-altucher-show/#.Ws3gcNPwai4
Melting old ladies
Of possible interest to monitoring is his simple explanations of the effect of averages on data, namely in relation to humans. Take the example of an old woman who likes the temperature to be a comfortable 70 degrees Fahrenheit. If you take an average of 70 degrees does it sound reasonable? Let’s examine a possible set of data - if it’s 0 degrees half the time, and 140 degrees the other half we’ve set a successful average of 70 degrees but our little old lady has unfortunately perished by either freezing or being burned to a crisp. Whoops!
The bell curve - or a gaussian curve representing normally distributed data. If your data fits into a normal distribution you can do useful things like anomaly detection* as the normal distribution gives you an idea of how frequently you should see certain values. If we imagine that this somehow represents temperature variation in a different version of our old lady example, we would expect the 0 degrees or 140 degrees events to fall on either side in the “very rare” event category and our old lady would probably be pretty happy within a standard deviation of the mean; the fat bit in the middle of the bell curve which would roughly cover a range of ~50 to 100 degrees. (She might get a little uncomfortable, but is unlikely to melt like the Nazi in Raiders of the Lost Ark).
Bell curves, averages and systems
So, for a lot of systems applications this turns out to be a poor model to follow - first of all, the data may not be normally distributed but follow something closer to a long-tailed distribution. The average of this data set may not represent anything particularly useful. If we take something like the time it takes to process a request on your web server, your average may fall within your SLA criteria but there may be large numbers of outliers pulling data one way or another.
Percentiles
In situations like this, percentile data is useful. Let’s take our web request timing example again: Viewing data for at the 50th percentile gives you a view of the median user experience - If the 50th percentile (median) of a response time is 750ms that means that 50% of my transactions are either as fast or faster than 750ms - sounds alright depending on what your site is doing. The 90th percentile view of the same data may be around 1250ms, which means that 90% of your requests fall on or under that speed with 10% of all requests completing slower. You may have a 98th+ percentile request that is 1750ms or slower, and this might represent a lesser-used use case such as a reporting function that takes a lot of time.
Using a percentile view of your data, you can see what the typical experience is for your users. A degradation in the 50th percentile from 750ms to 1000ms means that 50% of your requests just had a 25% slower experience and you probably need to start looking into it.
* = Ok, there are methods for anomaly detection on non-Gaussian data too, but normal distribution makes it a lot easier.