Releases: opstrace/opstrace
Opstrace v2021.11.17
Commits compared to the last release (v2021.09.17
) are listed here.
Breaking changes
Opstrace now requires you to set up your own (free tier) Auth0 integration. We have also decided to not support the magic *.opstrace.io
DNS names through our DNS reconfiguration service anymore.
Hence, the following Opstrace install-time configuration parameters are now required:
custom_dns_name
custom_auth0_client_id
custom_auth0_domain
The custom Auth0 integration is rather easy to set up. Accompanying this release, we have prepared a detailed guide for you: Opstrace with a custom Auth0 integration.
For documentation about the custom_dns_name
parameter we'd like to direct you to the configuration reference.
Component versions bumps
- Grafana was updated from v8.1.3 to v8.1.6.
Other changes
- We have started to rewrite 5xx HTTP errors emitted by Cortex into non-retryable errors to allow larger-scale systems to recover from downtime. This is a significant change that demands discussion. Also see #1410 and #1409.
- We updated a number of library dependencies across all components to address a variety of CVE scanner warnings.
Opstrace v2021.09.17
The full set of commits compared to the last release (v2021.08.13
) is listed here.
What's new
- We added an Integration for monitoring an external CockroachDB instance (#1321).
- We started instrumenting the Opstrace controller software with Prometheus metrics (#1322) and added a corresponding dashboard (#1356).
Component versions bumps
- Loki received a version bump from b3d7740 to a4b8974.
- Cortex was updated from e658571 to 74055d8.
- Grafana was updated from v8.1.1 to v8.1.3.
Fixed and improved
Core:
- We fixed a regression introduced in the last release as of which the Loki WebSocket endpoint for live-tailing logs was not available anymore (#1329). We also added a corresponding regression test.
- We fixed a Loki query response latency regression on AWS by setting the EC2 instance option
HttpPutResponseHopLimit
from1
(default) to2
. This is now done as part of the initialcreate
, but also as part ofupgrade
. Note that the change applied during an upgrade does not persist when an EC2 instance is lost unexpectedly. At the heart of the issue was theaws-go-sdk
taking around one minute for obtaining security credentials from the EC2 instance metadata service. It spent most of that time hopelessly retrying against a firewalling technique introduced in version 2 of the EC2 Instance Metadata Service. The full story is exciting and can be read in #1382. The debugging effort was motivated by an unstable test. - The custom Auth0 integration is now (hopefully) functional, via introduction of a new
custom_auth0_domain
install-time parameter (#1380, #1175). - A generic Kubernetes
StatefulSet
readiness check was improved for addressing an upgrade issue (#1296, #1294). - We tweaked the GRPC config used for Loki and Cortex components to reduce the likelihood for a
ENHANCE_YOUR_CALM, debug data: too_many_pings
error (#1362). - A number of system-internal alerts were tweaked (#1311, #1366, #1374, and others).
- We started changing the approach for issuing per-tenant TLS certificates to allow for easier state changes after the initial creation (#1371, #923).
CLI:
opstrace create
- GCP: the set of service connections is now logged before and after creation for enhanced debuggability (#1287).
opstrace destroy
UI:
- Error handling improvements landed for:
- login: better auto-healing of short transient issues around the flow-concluding HTTP POST request via the introduction of purpose-optimized retrying parameters (#1280).
- authentication state inspection: the interaction with
/_/auth/status
is now more robust via a change in tooling and retrying parameters (#1282). - installing and uninstalling an Integration (#1295, #1209).
- managing users and tenants (#1325, #1333, #1364, #1365, #1375, and others).
- the YAML document download feature (#1324).
- dark mode configuration (#1384).
- The Cortex ingest URL on the Getting Started page was fixed (#1397).
Developer experience and QA
This section does not aim for completeness. Yet, we'd like to point out some significant changes around developer experience and testing.
- We observed, debugged, and fixed a bunch of non-trivial CI instabilities, and also addressed build job duration regressions (#1286, #1285).
- test-remote: headless browser interaction now collects browser console contents for enhanced debuggability (#1282).
- We moved to using Golang 1.17 (#1305). We also transitioned to using TypeScript 4.4.x. and also bumped a number of dev tools (#1303).
- Looker, our Loki / Cortex testing and benchmarking tool, received saw significant development (#1310, #1315, #1353, #1361, #1363, #1370, #1377, #1389).
Opstrace v2021.08.13
The full set of commits compared to the last release (v2021.07.23
) is listed here.
What's new
- We integrated the opstrace/cortex-operator to manage the lifecycle of Cortex components in Opstrace. For now this is not expected to result in any user-facing behavioral change. However, it paves the pathway towards architectural consolidation and robustness.
- Opstrace now comes with the Loki query frontend (#1140). From the Loki documentation: "One of the most important functions of the query frontend is the ability to split larger queries into smaller ones, execute them in parallel, and stitch the results back together."
- The CLI now has an
info
command to inspect the version information of the various components an Opstrace instance is comprised of. Thanks to Eric Stroczynski for the contribution (#1047). - The Kubernetes Log Integration now supports container logs in both, the
CRI/containerd
format and in thedockerd
format. A radio button was added to the UI for you to make a choice (#1141). - The UI now shows a dashboard for pod metrics from the Kubernetes Metrics Integration (#1077).
Component versions bumps
- Loki received a significant version bump from 1ed19f7 to b3d7740 (the
v2.3.0
release plus a few commits). - Cortex was updated from fde0a62 to e658571 (the
v1.10.0
release plus a Thanos version bump). - Grafana was updated from v8.0.6 to v8.1.1.
Security fixes
- We fixed a vulnerability in the tenant API authenticator which allowed for cross-tenant data writing. Exploiting this required holding valid authentication proof for one of the tenants (#1144).
- We fixed a vulnerability in the UI login where user information was consumed from a non-trustworthy part of the login HTTP request, instead of consuming it from a cryptographically signed artifact.
Fixed and improved
CLI:
opstrace create
- Error handling around GCP service connection creation (between the Opstrace instance VPC and a Cloud SQL instance) was improved. This is to gain insight into how exactly the service connection creation may fail, and for better retrying (#1197).
- For AWS, we added support for the
ap-northeast-3
region and removed support for thecn-*
regions because of instabilities and API discrepancies. Please chime in on #1202 if you have opinions on this topic.
opstrace destroy
- GCP: we addressed a problem as of which DNS managed zones were not properly deleted (#1198).
opstrace upgrade
- The command now exits early when the current and the new versions match (#1225).
UI:
- Login:
- The "Access Denied" page is now only shown when there is a factual lack of privilege. Previously, this page was erroneously shown for authentication issues and transient issues of various kinds as well as for internal server errors (#1272).
- A new "Login Error" view was added for all login errors that are not related to a lack of privilege (#1272).
- Client-side React state management around login was consolidated. This is to address a set of problems leading to the display of just a white screen upon or during login, a symptom frequently observed by users (#1115).
- The server-side login routine robustness was enhanced by making the JSON Web Key Set fetcher more resilient to transient issues (#1224).
- Error handling improvements landed for:
- We added ingest URLs to the Getting Started page.
- The UI now shows build information.
- A number of WebSocket setup errors shown in the browser console were fixed.
Core:
- A rare error condition was fixed as of which the controller might become evicted (#1152).
- The controller log output was enhanced for better debuggability of stuck deployments (#1099).
- Fixed a condition as of which the controller log output could become huge (#1232).
Documentation:
- We added a new section to the "Configuring Alertmanager" user guide about using the new unified alerting UI for configuring and managing alerts.
Notable changes
- New Opstrace installations on GCP now use GKE version 1.19 (#1171). Note that Opstrace uses GKE's
STABLE
release channel and also GKE's default of having auto-upgrades for the API server as well as node pools enabled. If you upgrade from an older Opstrace release to this Opstrace release then you are probably still using GKE 1.18 under the hood -- we have prepared a special upgrade path which keeps the Opstrace system log collection compatible with both, GKE 1.18 and 1.19 (which is non-trivial, because the container log format changed between both versions), see #1264.
Developer experience and QA
This section does not aim for completeness. Yet, we'd like to point out some significant changes around developer experience and testing.
- CI now enforces
prettier --check
to pass for a large fraction of our TypeScript code base. - The README for
test-remote
was overhauled. - The README for UI development was improved.
- VSCode workspace settings were consolidated and better documented.
- UI TypeScript typings underwent a cleanup. Thanks @MoSattler.
- We added a bunch of automated tests for the UI, for example around folder deletion.
Opstrace v2021.07.23
The full set of commits compared to the last release (v2021.07.02
) is listed here.
What's new
- Added support for custom DNS infrastructure. Use this when your goal is to reach the Opstrace instance under a custom DNS name, using DNS infrastructure managed entirely by you. For enabling this, set the use new install-time configuration parameter
custom_dns_name
(see reference docs). - Added support for custom Auth0 integration. Use this when you want to log in to the web UI of your Opstrace instance via a custom Auth0 application. That for example allows for single sign-on against a custom identity provider. For enabling this, set the use new install-time configuration parameter
custom_auth0_client_id
(see reference docs). - The tenant API authentication token can now be communicated via the
Basic
HTTP authentication scheme: set the token as the password and pick an arbitrary user name. This allows for using client software that does not support theBearer
authentication scheme (such as certain plugins from the FluentD/Fluent Bit ecosystem).
Fixed and improved
Integrations:
- Kubernetes metrics integration: for better insight into container resource utilization, the following three kubelet scrape targets have been added:
/metrics/cadvisor
,/metrics/probes
,/metrics/resource
(#1070).
Core:
- Fixed a rare failure scenario where frequent crash-looping of Cortex/Loki pods could result in cross-pollination between the Cortex/Loki memberlist rings, resulting in ingest API downtime (#1053).
max_inflight_push_requests
is now set for Cortex distributors and ingesters to decrease the likelihood for them to run out of memory when exposed to highly concurrent push load.- The tenant creation API now rejects invalid tenant names. Previously, a newly added tenant with an invalid name would silently result in a dysfunctional state (#733, #710).
- The number of Cortex ingesters is now chosen more sanely for Opstrace instances with a large machine count (#1046).
CLI:
- The
create
does not overwrite the Opstrace controller config anymore. This is a change towards making it safer to invokecreate
more than once (#20, #1050). - The global
status
operation timeout was increased from one minute to five minutes to make it more resilient towards transient errors such as DNS name resolution hiccups (#981).
UI:
- Fixed a scenario where no tenant was selected (#1031).
- Improved error handling in the ring health section (#970).
Developer experience and QA
This section does not aim for completeness. Yet, we'd like to point out some significant changes around developer experience and testing.
- Upgrade tests are now executed with every pull request.
- The 'custom DNS' capability is tested with every pull request (in the GCP-based CI run).
- The NodeJS runtime version across all components has been changed from 14.x to 16.x. This affects CLI, controller, the UI's web application, and test runners.
- The TypeScript compiler
tsc
was bumped from 4.1 to 4.3 across the board. - The 'build info' system was changed for improving
tsc
output cachability. For local development, runmake set-build-info-constants
and then set the environment variableOPSTRACE_BUILDINFO_PATH
to point to thebuildinfo.json
file at the root of the repository.
Opstrace v2021.07.02
Commits compared to the last release (v2021.06.25
) are listed here.
Opstrace v2021.06.25
Commits compared to the last release (v2021.06.11
) are listed here.
Opstrace v2021.06.11
Commits compared to the last release (v2021.06.04
) are listed here.
Opstrace v2021.06.04
Commits compared to the last release (v2021.05.28
) are listed here.
Opstrace v2021.05.28
First Opstrace release via GitHub π