Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs on storage engine WAL failover #18511

Merged
merged 15 commits into from May 16, 2024
Merged

Conversation

rmloveland
Copy link
Contributor

@rmloveland rmloveland commented May 1, 2024

Fixes:

Summary of changes:

  • Add a new section to cockroach start describing the WAL failover feature, how to enable/disable, and the related logging config changes that are needed if you enable the feature

  • Add a new section to 'Monitoring and Alerting' docs describing the store status endpoint at _status/stores

  • Update logging docs to add some anchor links so we can refer to specific config settings from the WAL failover docs

  • Update v24.1 alpha release notes to link to the WAL failover docs

@rmloveland rmloveland marked this pull request as draft May 1, 2024 15:41
Copy link

github-actions bot commented May 1, 2024

Copy link

netlify bot commented May 1, 2024

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit 094dd7f
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/66464b647ffaa000081bbc06

Copy link

netlify bot commented May 1, 2024

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 094dd7f
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/66464b643600430008f637a4

Copy link

netlify bot commented May 1, 2024

Netlify Preview

Name Link
🔨 Latest commit 094dd7f
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/66464b648dbce800093953e1
😎 Deploy Preview https://deploy-preview-18511--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rmloveland rmloveland force-pushed the 20240501-DOC-9709-wal-failover branch 2 times, most recently from acee6e1 to 0ac8ab3 Compare May 7, 2024 15:59
@rmloveland rmloveland marked this pull request as ready for review May 7, 2024 15:59
@rmloveland
Copy link
Contributor Author

Hi folks, I added each of you to review for the following reasons/areas, but please feel free to comment on anything you see that is missing/incorrect/etc as well:

Copy link
Member

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question for @jbowens re: whether we should advertise the async buffering option for file sinks in our documentation.

cc @kevin-v-ngo as well.

src/current/v24.1/configure-logs.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Show resolved Hide resolved
Copy link

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:!

src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/v24.1/configure-logs.md Outdated Show resolved Hide resolved
Copy link
Contributor

@mwang1026 mwang1026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor edit suggestions -- notably a few more places I think we should mention that feature is in PREVIEW

I think we should also open a separate PR for how to monitor for failover -- the metrics to watch, how to inspect them, etc. -- thoughts?

src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/_includes/releases/v24.1/v24.1.0-alpha.4.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
@rmloveland
Copy link
Contributor Author

A few minor edit suggestions -- notably a few more places I think we should mention that feature is in PREVIEW

Updated in the places where you mentioned. Do we also want to add this to the list of Features in Preview for v24.1? My assumption is yes but figured I'd ask while you're here

I think we should also open a separate PR for how to monitor for failover -- the metrics to watch, how to inspect them, etc. -- thoughts?

That makes sense, I can make a followup PR - @jbowens I found the following metrics on the custom chart debug page of a 24.1 RC cluster. Which ones do you think make sense to monitor and what values should one alert on? Based on looking I'd guess switch.count could be a starting point? followed by the durations? but I'm just guessing :-)

  • storage.wal.failover.primary.duration
  • storage.wal.failover.secondary.duration
  • storage.wal.failover.switch.count
  • storage.wal.failover.write_and_sync.latency-avg
  • storage.wal.failover.write_and_sync.latency-count
  • storage.wal.failover.write_and_sync.latency-max
  • storage.wal.failover.write_and_sync.latency-sum
  • storage.wal.failover.write_and_sync.latency-p50
  • storage.wal.failover.write_and_sync.latency-p75
  • storage.wal.failover.write_and_sync.latency-p90
  • storage.wal.failover.write_and_sync.latency-p99
  • storage.wal.failover.write_and_sync.latency-p99.9
  • storage.wal.failover.write_and_sync.latency-p99.99
  • storage.wal.failover.write_and_sync.latency-p99.999

@jbowens
Copy link

jbowens commented May 13, 2024

Which ones do you think make sense to monitor and what values should one alert on? Based on looking I'd guess switch.count could be a starting point? followed by the durations? but I'm just guessing :-)

Yeah, I think it makes sense to document those first three metrics:

storage.wal.failover.primary.duration
storage.wal.failover.secondary.duration
storage.wal.failover.switch.count

The storage.wal.failover.secondary.duration is probably the most interesting. Customers will generally expect this to be zero unless there's a failover. Then they might care about how long it remains non-zero because it provides indication into the health of the primary.

@rmloveland
Copy link
Contributor Author

Yeah, I think it makes sense to document those first three metrics:

storage.wal.failover.primary.duration
storage.wal.failover.secondary.duration
storage.wal.failover.switch.count

The storage.wal.failover.secondary.duration is probably the most interesting. Customers will generally expect this to be zero unless there's a failover. Then they might care about how long it remains non-zero because it provides indication into the health of the primary.

Thanks @jbowens - I've filed https://cockroachlabs.atlassian.net/browse/DOC-10268 and will do that as a followup after we get this PR in

@mwang1026 are you good with this given the recent updates based on your feedback? Remaining open question is if you also want a blurb in https://www.cockroachlabs.com/docs/v24.1/cockroachdb-feature-availability.html#features-in-preview or if you'd rather this feature did not show up there. I believe our practice is to also list it there but maybe you don't want it there, idk

@rmloveland
Copy link
Contributor Author

@mwang1026 I went ahead and added WAL failover to the list of preview features since AFAICT we do that for everything else in Preview

Let me know if you're good with the other changes and I'll send this along for docs team review so I can get it merged ASAP

Thanks!

Copy link
Contributor

@mwang1026 mwang1026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Fine to add to list of Preview features. I couldn't unsee that there are some features on that list that are GA in 24.1 but let's let those docs roll in :)

@rmloveland
Copy link
Contributor Author

@florence-crl this is RFAL from a docs POV now

in terms of sequencing, this should go in first, then #18548

Copy link
Contributor

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up! Let me know if you have questions on my suggestions.

src/current/_includes/releases/v24.1/v24.1.0-alpha.4.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Show resolved Hide resolved
src/current/v24.1/cockroach-start.md Outdated Show resolved Hide resolved
src/current/v24.1/cockroachdb-feature-availability.md Outdated Show resolved Hide resolved
src/current/v24.1/monitoring-and-alerting.md Show resolved Hide resolved
Copy link
Contributor

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header not aligned properly

src/current/v24.1/monitoring-and-alerting.md Outdated Show resolved Hide resolved
@rmloveland
Copy link
Contributor Author

@florence-crl thanks for the helpful review. I've incorporated everything from your first pass AFAICT - PTAL!

Copy link
Contributor

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Fixes:

- DOC-9709
- DOC-9916
- DOC-9925
- DOC-10149

Summary of changes:

- Add a new section to `cockroach start` describing the WAL failover
  feature, how to enable/disable, and the related logging config changes
  that are needed if you enable the feature

- Add a new section to 'Monitoring and Alerting' docs describing the
  store status endpoint at `_status/stores`

- Update logging docs to add some anchor links so we can refer to
  specific config settings from the WAL failover docs

- Update v24.1 alpha release notes to link to the WAL failover docs
rmloveland and others added 13 commits May 16, 2024 14:02
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
Co-authored-by: Florence Morris <florence@cockroachlabs.com>
@rmloveland rmloveland force-pushed the 20240501-DOC-9709-wal-failover branch from c466735 to 094dd7f Compare May 16, 2024 18:07
@rmloveland rmloveland enabled auto-merge (squash) May 16, 2024 18:07
@rmloveland rmloveland merged commit 8d3764d into main May 16, 2024
7 checks passed
@rmloveland rmloveland deleted the 20240501-DOC-9709-wal-failover branch May 16, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants