Skip to content
View muffix's full-sized avatar

Highlights

  • Pro

Organizations

@bosun-monitor
Block or Report

Block or report muffix

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
muffix/README.md

Hi. 👋

I'm a Software Engineer with Datadog, based in Munich, Germany. 🥨 Ex-Skyscanner.

Find me on these networks

Stack Overflow LinkedIn Twitter Meetup

Buy me a coffee

If you found any of my work useful, you might want to consider buying me a coffee.

Buy Me A Coffee

Blog posts

I occasionally write engineering blog posts. Sometimes my colleagues blog about our work, too.

Roll up to speed up: Improving OpenTSDB query performance

How to improve the query performance for Skyscanner's OpenTSDB cluster and enabling queries that previously were impossible to serve by reducing the resolution of historic data.

👉 Roll up to speed up: Improving OpenTSDB query performance

The problem that wasn’t there — and the Bosun alerts that were

By Annette Wilson

Annette blogged about phantom alerts that our alerting solution Bosun would fire every so often, paging on-call engineers, but turn out to be false every time. The alert condition which was met and triggered the alert, would recover on the next evaluation, only split seconds later. Subsequent investigation and resubmitting the exact same query wouldn't show any sign of a problem, let alone the alert condition being met.

It had been annoying us for two years, but it also happened infrequently enough that investigation any efforts were regularly abandoned without meaningful results until years later. It was mysterious and interesting enough to still blog about it, though. Also, we really wanted to sleep comfortably again without being woken up by a false alert looming. The blog post describes the problem and in an addendum how I finally found the root cause.

TL;DR - Expand here to show the root cause if you don't like exciting stories

Our initial suspicion of a bug in Bosun turned out incorrect. When our timeseries database OpenTSDB serves a query, it uses 8 scanners to return all the required data from HBase asynchronously and proceeds to merge them before returning the result to the client.

The scanners write the results to a map. The datastructure used to generate the key for tese results, however, wasn't thread-safe and in a rare race condition could return the same key for two scanners which meant that one overwrote the other's results. Bosun had incomplete data and the alert went into an unknown state, paging the on-call engineer.

The unspectacular fix can be seen in OpenTSDB/opentsdb#1754.

👉 The problem that wasn’t there — and the Bosun alerts that were

Pinned

  1. Skyscanner/OpenTSDB-rollup Skyscanner/OpenTSDB-rollup Public

    Spark job generating rollup data points from a snapshot of an OpenTSDB raw data table

    Java 3 3

  2. pre-commit-hooks pre-commit-hooks Public

    A collection of pre-commit hooks that I find useful

    Shell 1 2

  3. byImpf byImpf Public archive

    Script to check for and book vaccination appointments in Bavarian vaccination centres

    Python 2

  4. alfred_timestamps alfred_timestamps Public

    A workflow to convert timestamps

    Rust 1

  5. pyfred-cli pyfred-cli Public

    Build Python workflows for Alfred with ease

    Python 2

  6. couleurbummel couleurbummel Public

    Source code for the app Couleurbummel

    TypeScript 18 1