1 Star 0 Fork 0

风的灵动 / micrometer

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
NOTES.md 2.73 KB
一键复制 编辑 原始数据 按行查看 历史
Jon Schneider 提交于 2017-06-29 09:12 . Add notes from Netflix meeting

Netflix 6/21/17

Alerting

  • Support triggering on one query, sending alert notifications on another. One is to discover a problem exists, the other to provide actionable insight quickly.
  • Alert queries are often instance bound, e.g. :group-by instance to generate a signal per instance. Healthy instances will tend to cluster together in a mass of color, and the bad instance(s) will be visibly divergent.
  • Look at :rolling-count.

ACA

  • Most canaries run 1-4 hours. Extreme cases run 96 hours. Shorter windows are recommended, as bad canaries are receiving production traffic -- should be failed as soon as possible.
  • ACA at Netflix is approaching limits on how many queries it is allowed to perform against Atlas in a given window. To improve, could combine queries to baseline and canary in one and split the results in ACA. Also, may be able to use LWC streaming?
  • Good canary metrics include box level metrics, error counting metrics with sufficient volume to present a statistically viable comparison.
  • Because a normal distribution cannot be assumed, uses Mann-Whitney U Test.
  • Use canary analysis to inform alerts -- if a canary is critically failing, even if it only lasts 4 hours, shouldn't somebody have been alerted on failures before?
  • Good ACA criteria = U.S.E. = Utilization, Saturation, Error (Brendan Gregg)
  • Some teams basing ACAs off of high-dimension data stored in Druid.

Atlas

  • Look at :dist-max when concerned about the absolute max in a given window. The max statistic shown on a graph is the maximum value that the plot yields, but this may be the max average value if the plot is :avg, for example. :dist-max plots the maximum sample at each step. So, max of :dist-max is the maximum sample seen along the plot's x-axis.
  • Constant-time lookup function on buckets is important.
  • Bucket functions lead to a mergeable quantile approximation.
  • There may be a static 276 bucket histogram that leads to good error bounds on quantile approximation for majority of use cases.
  • Standard deviation calculation often exhibits high error bounds because of cliffs:
    • Left-side cliff on payload size that represents minimum header size
    • Right-side cliff on latency that represents HTTP timeout
    • For a latency timer across all endpoints in an app, distribution can be wildly non-normal because of different levels of computation and I/O across those endpoints.
  • Say no to t-digests.
  • Counters not decrementable
  • Look at :cq, :list, :each for an easy way to tack on additional criteria from a dashboard-building app without understanding the existing structure of the query.
  • :dist-avg does the totalTime/count division math for you.
  • r3.2xlg with 60GB RAM capable of managing 2M time series over 6 hours.
Java
1
https://gitee.com/king019/micrometer.git
git@gitee.com:king019/micrometer.git
king019
micrometer
micrometer
master

搜索帮助