Skip to content
This repository has been archived by the owner on Jan 12, 2022. It is now read-only.

Releases: opstrace/opstrace

Opstrace v2021.11.17

17 Nov 20:25
Compare
Choose a tag to compare

Commits compared to the last release (v2021.09.17) are listed here.

Breaking changes

Opstrace now requires you to set up your own (free tier) Auth0 integration. We have also decided to not support the magic *.opstrace.io DNS names through our DNS reconfiguration service anymore.

Hence, the following Opstrace install-time configuration parameters are now required:

  • custom_dns_name
  • custom_auth0_client_id
  • custom_auth0_domain

The custom Auth0 integration is rather easy to set up. Accompanying this release, we have prepared a detailed guide for you: Opstrace with a custom Auth0 integration.

For documentation about the custom_dns_name parameter we'd like to direct you to the configuration reference.

Component versions bumps

Other changes

  • We have started to rewrite 5xx HTTP errors emitted by Cortex into non-retryable errors to allow larger-scale systems to recover from downtime. This is a significant change that demands discussion. Also see #1410 and #1409.
  • We updated a number of library dependencies across all components to address a variety of CVE scanner warnings.

Opstrace v2021.09.17

17 Sep 13:27
Compare
Choose a tag to compare

The full set of commits compared to the last release (v2021.08.13) is listed here.

What's new

  • We added an Integration for monitoring an external CockroachDB instance (#1321).
  • We started instrumenting the Opstrace controller software with Prometheus metrics (#1322) and added a corresponding dashboard (#1356).

Component versions bumps

Fixed and improved

Core:

  • We fixed a regression introduced in the last release as of which the Loki WebSocket endpoint for live-tailing logs was not available anymore (#1329). We also added a corresponding regression test.
  • We fixed a Loki query response latency regression on AWS by setting the EC2 instance option HttpPutResponseHopLimit from 1 (default) to 2. This is now done as part of the initial create, but also as part of upgrade. Note that the change applied during an upgrade does not persist when an EC2 instance is lost unexpectedly. At the heart of the issue was the aws-go-sdk taking around one minute for obtaining security credentials from the EC2 instance metadata service. It spent most of that time hopelessly retrying against a firewalling technique introduced in version 2 of the EC2 Instance Metadata Service. The full story is exciting and can be read in #1382. The debugging effort was motivated by an unstable test.
  • The custom Auth0 integration is now (hopefully) functional, via introduction of a new custom_auth0_domain install-time parameter (#1380, #1175).
  • A generic Kubernetes StatefulSet readiness check was improved for addressing an upgrade issue (#1296, #1294).
  • We tweaked the GRPC config used for Loki and Cortex components to reduce the likelihood for a ENHANCE_YOUR_CALM, debug data: too_many_pings error (#1362).
  • A number of system-internal alerts were tweaked (#1311, #1366, #1374, and others).
  • We started changing the approach for issuing per-tenant TLS certificates to allow for easier state changes after the initial creation (#1371, #923).

CLI:

  • opstrace create
    • GCP: the set of service connections is now logged before and after creation for enhanced debuggability (#1287).
  • opstrace destroy
    • GCP: the region to destroy in can now be specified via --region (#1291). Note that multi-region support for GCP is not yet properly tested.
    • GCP: global address teardown has been consolidated (#1320, #976).

UI:

  • Error handling improvements landed for:
    • login: better auto-healing of short transient issues around the flow-concluding HTTP POST request via the introduction of purpose-optimized retrying parameters (#1280).
    • authentication state inspection: the interaction with /_/auth/status is now more robust via a change in tooling and retrying parameters (#1282).
    • installing and uninstalling an Integration (#1295, #1209).
    • managing users and tenants (#1325, #1333, #1364, #1365, #1375, and others).
    • the YAML document download feature (#1324).
    • dark mode configuration (#1384).
  • The Cortex ingest URL on the Getting Started page was fixed (#1397).

Developer experience and QA

This section does not aim for completeness. Yet, we'd like to point out some significant changes around developer experience and testing.

  • We observed, debugged, and fixed a bunch of non-trivial CI instabilities, and also addressed build job duration regressions (#1286, #1285).
  • test-remote: headless browser interaction now collects browser console contents for enhanced debuggability (#1282).
  • We moved to using Golang 1.17 (#1305). We also transitioned to using TypeScript 4.4.x. and also bumped a number of dev tools (#1303).
  • Looker, our Loki / Cortex testing and benchmarking tool, received saw significant development (#1310, #1315, #1353, #1361, #1363, #1370, #1377, #1389).

Opstrace v2021.08.13

13 Aug 15:33
Compare
Choose a tag to compare

The full set of commits compared to the last release (v2021.07.23) is listed here.

What's new

  • We integrated the opstrace/cortex-operator to manage the lifecycle of Cortex components in Opstrace. For now this is not expected to result in any user-facing behavioral change. However, it paves the pathway towards architectural consolidation and robustness.
  • Opstrace now comes with the Loki query frontend (#1140). From the Loki documentation: "One of the most important functions of the query frontend is the ability to split larger queries into smaller ones, execute them in parallel, and stitch the results back together."
  • The CLI now has an info command to inspect the version information of the various components an Opstrace instance is comprised of. Thanks to Eric Stroczynski for the contribution (#1047).
  • The Kubernetes Log Integration now supports container logs in both, the CRI/containerd format and in the dockerd format. A radio button was added to the UI for you to make a choice (#1141).
  • The UI now shows a dashboard for pod metrics from the Kubernetes Metrics Integration (#1077).

Component versions bumps

Security fixes

  • We fixed a vulnerability in the tenant API authenticator which allowed for cross-tenant data writing. Exploiting this required holding valid authentication proof for one of the tenants (#1144).
  • We fixed a vulnerability in the UI login where user information was consumed from a non-trustworthy part of the login HTTP request, instead of consuming it from a cryptographically signed artifact.

Fixed and improved

CLI:

  • opstrace create
    • Error handling around GCP service connection creation (between the Opstrace instance VPC and a Cloud SQL instance) was improved. This is to gain insight into how exactly the service connection creation may fail, and for better retrying (#1197).
    • For AWS, we added support for the ap-northeast-3 region and removed support for the cn-* regions because of instabilities and API discrepancies. Please chime in on #1202 if you have opinions on this topic.
  • opstrace destroy
    • GCP: we addressed a problem as of which DNS managed zones were not properly deleted (#1198).
  • opstrace upgrade
    • The command now exits early when the current and the new versions match (#1225).

UI:

  • Login:
    • The "Access Denied" page is now only shown when there is a factual lack of privilege. Previously, this page was erroneously shown for authentication issues and transient issues of various kinds as well as for internal server errors (#1272).
    • A new "Login Error" view was added for all login errors that are not related to a lack of privilege (#1272).
    • Client-side React state management around login was consolidated. This is to address a set of problems leading to the display of just a white screen upon or during login, a symptom frequently observed by users (#1115).
    • The server-side login routine robustness was enhanced by making the JSON Web Key Set fetcher more resilient to transient issues (#1224).
  • Error handling improvements landed for:
    • installing and uninstalling an Integration (#1199, #1138, #1088).
    • performing certain HTTP requests (#1188).
    • displaying the hash ring table (#1149).
  • We added ingest URLs to the Getting Started page.
  • The UI now shows build information.
  • A number of WebSocket setup errors shown in the browser console were fixed.

Core:

  • A rare error condition was fixed as of which the controller might become evicted (#1152).
  • The controller log output was enhanced for better debuggability of stuck deployments (#1099).
  • Fixed a condition as of which the controller log output could become huge (#1232).

Documentation:

  • We added a new section to the "Configuring Alertmanager" user guide about using the new unified alerting UI for configuring and managing alerts.

Notable changes

  • New Opstrace installations on GCP now use GKE version 1.19 (#1171). Note that Opstrace uses GKE's STABLE release channel and also GKE's default of having auto-upgrades for the API server as well as node pools enabled. If you upgrade from an older Opstrace release to this Opstrace release then you are probably still using GKE 1.18 under the hood -- we have prepared a special upgrade path which keeps the Opstrace system log collection compatible with both, GKE 1.18 and 1.19 (which is non-trivial, because the container log format changed between both versions), see #1264.

Developer experience and QA

This section does not aim for completeness. Yet, we'd like to point out some significant changes around developer experience and testing.

  • CI now enforces prettier --check to pass for a large fraction of our TypeScript code base.
  • The README for test-remote was overhauled.
  • The README for UI development was improved.
  • VSCode workspace settings were consolidated and better documented.
  • UI TypeScript typings underwent a cleanup. Thanks @MoSattler.
  • We added a bunch of automated tests for the UI, for example around folder deletion.

Opstrace v2021.07.23

23 Jul 13:22
Compare
Choose a tag to compare

The full set of commits compared to the last release (v2021.07.02) is listed here.

What's new

  • Added support for custom DNS infrastructure. Use this when your goal is to reach the Opstrace instance under a custom DNS name, using DNS infrastructure managed entirely by you. For enabling this, set the use new install-time configuration parameter custom_dns_name (see reference docs).
  • Added support for custom Auth0 integration. Use this when you want to log in to the web UI of your Opstrace instance via a custom Auth0 application. That for example allows for single sign-on against a custom identity provider. For enabling this, set the use new install-time configuration parameter custom_auth0_client_id (see reference docs).
  • The tenant API authentication token can now be communicated via the Basic HTTP authentication scheme: set the token as the password and pick an arbitrary user name. This allows for using client software that does not support the Bearer authentication scheme (such as certain plugins from the FluentD/Fluent Bit ecosystem).

Fixed and improved

Integrations:

  • Kubernetes metrics integration: for better insight into container resource utilization, the following three kubelet scrape targets have been added: /metrics/cadvisor, /metrics/probes, /metrics/resource (#1070).

Core:

  • Fixed a rare failure scenario where frequent crash-looping of Cortex/Loki pods could result in cross-pollination between the Cortex/Loki memberlist rings, resulting in ingest API downtime (#1053).
  • max_inflight_push_requests is now set for Cortex distributors and ingesters to decrease the likelihood for them to run out of memory when exposed to highly concurrent push load.
  • The tenant creation API now rejects invalid tenant names. Previously, a newly added tenant with an invalid name would silently result in a dysfunctional state (#733, #710).
  • The number of Cortex ingesters is now chosen more sanely for Opstrace instances with a large machine count (#1046).

CLI:

  • The create does not overwrite the Opstrace controller config anymore. This is a change towards making it safer to invoke create more than once (#20, #1050).
  • The global status operation timeout was increased from one minute to five minutes to make it more resilient towards transient errors such as DNS name resolution hiccups (#981).

UI:

  • Fixed a scenario where no tenant was selected (#1031).
  • Improved error handling in the ring health section (#970).

Developer experience and QA

This section does not aim for completeness. Yet, we'd like to point out some significant changes around developer experience and testing.

  • Upgrade tests are now executed with every pull request.
  • The 'custom DNS' capability is tested with every pull request (in the GCP-based CI run).
  • The NodeJS runtime version across all components has been changed from 14.x to 16.x. This affects CLI, controller, the UI's web application, and test runners.
  • The TypeScript compiler tsc was bumped from 4.1 to 4.3 across the board.
  • The 'build info' system was changed for improving tsc output cachability. For local development, run make set-build-info-constants and then set the environment variable OPSTRACE_BUILDINFO_PATH to point to the buildinfo.json file at the root of the repository.

Opstrace v2021.07.02

02 Jul 16:43
Compare
Choose a tag to compare

Commits compared to the last release (v2021.06.25) are listed here.

Opstrace v2021.06.25

25 Jun 14:35
Compare
Choose a tag to compare

Commits compared to the last release (v2021.06.11) are listed here.

Opstrace v2021.06.11

11 Jun 14:11
Compare
Choose a tag to compare

Commits compared to the last release (v2021.06.04) are listed here.

Opstrace v2021.06.04

04 Jun 13:42
Compare
Choose a tag to compare

Commits compared to the last release (v2021.05.28) are listed here.

Opstrace v2021.05.28

28 May 13:16
Compare
Choose a tag to compare

First Opstrace release via GitHub πŸ™