vmagent could send remote-write requests in Snappy compression after the VM protocol handshake #3929

jiekun · 2024-05-13T11:47:21Z

What type of bug is this?

Unexpected error

What subsystems are affected?

Write Protocols

Minimal reproduce step

As commented in #3641:

vmagent could send remote-write requests in Snappy compression after the VM protocol handshake.

This occurs when:

vmagent sends data in Snappy.
vmagent buffers data on disk when the remote-write target is down for upgrading or other reasons.
vmagent restarts after the remote-write target is up and performs the handshake.
vmagent sends data in zstd.

In the fourth step, vmagent actually needs to send the buffered (on-disk) data (in Snappy) before sending new data.

I mentioned this issue in https://jiekun.dev/posts/vmagent-data-structures (思考题2). VictoriaMetrics will handle such situations with fallback logic. It will try Snappy when zstd encounters an error.

I'm not familiar with Rust, so I haven't read the code and I'm not sure if this is also implemented in GreptimeDB. This could potentially cause vmagent to receive an error response and retry. I don't know if this is a necessary feature for GreptimeDB since it's hard to reproduce in production :)

Edit: It seems the related source codes is:

greptimedb/src/servers/src/http/prom_store.rs

Line 250 in 6ab3aeb

let buf = Bytes::from(if is_zstd {

It could be fixed like this (generated by ChatGPT):

let buf = match is_zstd {
    true => {
        match zstd_decompress(&body[..]) {
            Ok(result) => Bytes::from(result),
            Err(_) => snappy_decompress(&body[..]).map_err(|_| "Both decompression methods failed")?,
        }
    }
    false => {
        match snappy_decompress(&body[..]) {
            Ok(result) => Bytes::from(result),
            Err(_) => zstd_decompress(&body[..]).map_err(|_| "Both decompression methods failed")?,
        }
    }
};

What operating system did you use?

Ubuntu 22.04 x64

What version of GreptimeDB did you use?

v0.7.2

Notes

As @zyy17 said this is not a bug of GreptimeDB. It's actually caused by vmagent who failed to send compressed data that match the protocol flag in HTTP header. I believe we could have a simple discussion here to see if GreptimeDB should have fallback logic for those edge case.

The text was updated successfully, but these errors were encountered:

killme2008 · 2024-05-13T12:13:10Z

@jiekun Thanks for sharing. Looks like it's a corner case of vmagent.

Your attached patch appears to work. It would be great if you could create a pull request to implement this feature.

jiekun · 2024-05-13T12:23:38Z

Cool.

I would like to discuss with the VM team first to see if there is any plan to optimize it in the vmagent. I doubt this is not going to happen since it requires significant changes in compression and on-disk persistent procedures.

If it won't happen in a short period of time, I will try to raise a PR here for a temporary fix.

zyy17 · 2024-05-14T02:30:10Z

Some related issue and PR in vmagent:

sunng87 · 2024-05-17T22:53:41Z

I see. Being a stateless protocol, vmagent has no idea when it should reissue a handshake when peer (the database) restarts. I think this can be a design issue for vm, which should return an error code like content mismatch to indicate vmagent to reissue handshake.

jiekun · 2024-05-17T23:45:39Z

return an error code like content mismatch to indicate vmagent to reissue handshake.

I think this won't work. The root cause is that vmagent won't know anything about the pending blocks in queue. It could be in Snappy and zstd. And vmagent will never check it before sending them out.

Let's assume vminsert or GreptimeDB can return an error code when protocol mismatch. vmagent could know this situation but cannot do anything to the data in queue, which mix data in Snappy (added to the queue before vmagent restart and Snappy is used) and zstd(added to the queue after vmagent restart).

The ultimate solution for this case is to mark each pending blocks(requests) with flag, to indicate the algo it used. It require modification of many things and data structure, and is less likely to happen.

sunng87 · 2024-05-20T20:46:45Z

I think it makes sense for the server/database to fallback to zstd. If we ask the client to retry, there is a chance to run into dead loop when the file is corrupted.

By the way, @jiekun , are you running into this issue when using vmagent with greptimedb on your setup? or you just discovered this because you are familiar with vmagent?

jiekun · 2024-05-20T23:14:30Z

you just discovered this because you are familiar with vmagent

I realized it when I saw the protocol support pull request.

jiekun · 2024-05-21T09:05:10Z

I've talked to the maintainer of the VM and confirmed that there are no plans to fix it within vmagent at the moment.

My day job has been extremely busy in the past few weeks (and also in the upcoming weeks), so it has been challenging for me to write high-quality code (especially considering it's my first time working with Rust) and conduct tests.

If anyone plans to solve it for GreptimeDB, please feel free to proceed :)

sunng87 · 2024-05-21T22:01:10Z

oh sure. Thank you for the report! I will be working on this.

sunng87 mentioned this issue May 21, 2024

feat: add fallback logic for vmagent sending wrong content type #4009

Merged

3 tasks

sunng87 self-assigned this May 21, 2024

zyy17 closed this as completed in #4009 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmagent could send remote-write requests in Snappy compression after the VM protocol handshake #3929

vmagent could send remote-write requests in Snappy compression after the VM protocol handshake #3929

jiekun commented May 13, 2024 •

edited

killme2008 commented May 13, 2024

jiekun commented May 13, 2024 •

edited

zyy17 commented May 14, 2024

sunng87 commented May 17, 2024

jiekun commented May 17, 2024 •

edited

sunng87 commented May 20, 2024

jiekun commented May 20, 2024 •

edited

jiekun commented May 21, 2024

sunng87 commented May 21, 2024

vmagent could send remote-write requests in Snappy compression after the VM protocol handshake #3929

vmagent could send remote-write requests in Snappy compression after the VM protocol handshake #3929

Comments

jiekun commented May 13, 2024 • edited

What type of bug is this?

What subsystems are affected?

Minimal reproduce step

What operating system did you use?

What version of GreptimeDB did you use?

Notes

killme2008 commented May 13, 2024

jiekun commented May 13, 2024 • edited

zyy17 commented May 14, 2024

sunng87 commented May 17, 2024

jiekun commented May 17, 2024 • edited

sunng87 commented May 20, 2024

jiekun commented May 20, 2024 • edited

jiekun commented May 21, 2024

sunng87 commented May 21, 2024

jiekun commented May 13, 2024 •

edited

jiekun commented May 13, 2024 •

edited

jiekun commented May 17, 2024 •

edited

jiekun commented May 20, 2024 •

edited