Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmagent could send remote-write requests in Snappy compression after the VM protocol handshake #3929

Closed
jiekun opened this issue May 13, 2024 · 9 comments · Fixed by #4009
Closed
Assignees

Comments

@jiekun
Copy link

jiekun commented May 13, 2024

What type of bug is this?

Unexpected error

What subsystems are affected?

Write Protocols

Minimal reproduce step

As commented in #3641:

vmagent could send remote-write requests in Snappy compression after the VM protocol handshake.

This occurs when:

  1. vmagent sends data in Snappy.
  2. vmagent buffers data on disk when the remote-write target is down for upgrading or other reasons.
  3. vmagent restarts after the remote-write target is up and performs the handshake.
  4. vmagent sends data in zstd.

In the fourth step, vmagent actually needs to send the buffered (on-disk) data (in Snappy) before sending new data.

I mentioned this issue in https://jiekun.dev/posts/vmagent-data-structures (思考题2). VictoriaMetrics will handle such situations with fallback logic. It will try Snappy when zstd encounters an error.

I'm not familiar with Rust, so I haven't read the code and I'm not sure if this is also implemented in GreptimeDB. This could potentially cause vmagent to receive an error response and retry. I don't know if this is a necessary feature for GreptimeDB since it's hard to reproduce in production :)


Edit: It seems the related source codes is:

let buf = Bytes::from(if is_zstd {

It could be fixed like this (generated by ChatGPT):

let buf = match is_zstd {
    true => {
        match zstd_decompress(&body[..]) {
            Ok(result) => Bytes::from(result),
            Err(_) => snappy_decompress(&body[..]).map_err(|_| "Both decompression methods failed")?,
        }
    }
    false => {
        match snappy_decompress(&body[..]) {
            Ok(result) => Bytes::from(result),
            Err(_) => zstd_decompress(&body[..]).map_err(|_| "Both decompression methods failed")?,
        }
    }
};

What operating system did you use?

Ubuntu 22.04 x64

What version of GreptimeDB did you use?

v0.7.2

Notes

As @zyy17 said this is not a bug of GreptimeDB. It's actually caused by vmagent who failed to send compressed data that match the protocol flag in HTTP header. I believe we could have a simple discussion here to see if GreptimeDB should have fallback logic for those edge case.

@killme2008
Copy link
Contributor

@jiekun Thanks for sharing. Looks like it's a corner case of vmagent.

Your attached patch appears to work. It would be great if you could create a pull request to implement this feature.

@jiekun
Copy link
Author

jiekun commented May 13, 2024

Cool.

I would like to discuss with the VM team first to see if there is any plan to optimize it in the vmagent. I doubt this is not going to happen since it requires significant changes in compression and on-disk persistent procedures.

If it won't happen in a short period of time, I will try to raise a PR here for a temporary fix.

@zyy17
Copy link
Collaborator

zyy17 commented May 14, 2024

@sunng87
Copy link
Member

sunng87 commented May 17, 2024

I see. Being a stateless protocol, vmagent has no idea when it should reissue a handshake when peer (the database) restarts. I think this can be a design issue for vm, which should return an error code like content mismatch to indicate vmagent to reissue handshake.

@jiekun
Copy link
Author

jiekun commented May 17, 2024

return an error code like content mismatch to indicate vmagent to reissue handshake.

I think this won't work. The root cause is that vmagent won't know anything about the pending blocks in queue. It could be in Snappy and zstd. And vmagent will never check it before sending them out.

Let's assume vminsert or GreptimeDB can return an error code when protocol mismatch. vmagent could know this situation but cannot do anything to the data in queue, which mix data in Snappy (added to the queue before vmagent restart and Snappy is used) and zstd(added to the queue after vmagent restart).

The ultimate solution for this case is to mark each pending blocks(requests) with flag, to indicate the algo it used. It require modification of many things and data structure, and is less likely to happen.

@sunng87
Copy link
Member

sunng87 commented May 20, 2024

I think it makes sense for the server/database to fallback to zstd. If we ask the client to retry, there is a chance to run into dead loop when the file is corrupted.

By the way, @jiekun , are you running into this issue when using vmagent with greptimedb on your setup? or you just discovered this because you are familiar with vmagent?

@jiekun
Copy link
Author

jiekun commented May 20, 2024

you just discovered this because you are familiar with vmagent

I realized it when I saw the protocol support pull request.

@jiekun
Copy link
Author

jiekun commented May 21, 2024

I've talked to the maintainer of the VM and confirmed that there are no plans to fix it within vmagent at the moment.

My day job has been extremely busy in the past few weeks (and also in the upcoming weeks), so it has been challenging for me to write high-quality code (especially considering it's my first time working with Rust) and conduct tests.

If anyone plans to solve it for GreptimeDB, please feel free to proceed :)

@sunng87
Copy link
Member

sunng87 commented May 21, 2024

oh sure. Thank you for the report! I will be working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants