Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add snapshot step in etcdutl migrate #17911

Open
siyuanfoundation opened this issue Apr 29, 2024 · 12 comments
Open

add snapshot step in etcdutl migrate #17911

siyuanfoundation opened this issue Apr 29, 2024 · 12 comments
Assignees
Labels
area/etcdutl priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature

Comments

@siyuanfoundation
Copy link
Contributor

siyuanfoundation commented Apr 29, 2024

What would you like to be added?

Currently during downgrade, if there is a WAL entry with fields from the higher version, the offline migration tool etcdutl migrate would abort because of the version check. This would block the offline downgrade migration even if the entries of higher version have been committed to the backend.

On the contrary, in the online migration, after enabling downgrade, ClusterVersionSet would downgrade the cluster version, and force snapshot, so that old WAL entries would be compacted and not blocking the migration.

We need to make the offline migration consistent with the online migration flow, by adding a snapshot step before calling schema.Migrate

Why is this needed?

To make offline downgrade consistent with online downgrade.

#13168 Milestone 3

@siyuanfoundation
Copy link
Contributor Author

cc @serathius @henrybear327

@jmhbnz jmhbnz added area/etcdctl area/etcdutl priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed area/etcdctl labels May 2, 2024
@ah8ad3
Copy link
Contributor

ah8ad3 commented May 6, 2024

If you think this task is suitable for me i would like to give it a try.

@siyuanfoundation
Copy link
Contributor Author

@ah8ad3 yes, it would be great if you can give it a try. You can ping me on slack if you have any questions. Thank you!

@ah8ad3
Copy link
Contributor

ah8ad3 commented May 11, 2024

Sorry do we have any implementation of getting snapshot without need of server? i couldn't find any, all i find needs Endpoint to get snapshot from server.
If i'm right we need to create an offline snapshotter or we should use the snapshot.SaveWithVersion.
Am i right? @siyuanfoundation @serathius

@siyuanfoundation
Copy link
Contributor Author

snapshot.SaveWithVersion fetches snapshot from remote etcd server, and when the data is connected to an ectd server, it could add new entries, which is not desirable. So it would need to be an offline snapshotter or just something to clean up the old WAL entries. @ah8ad3 can you dive into the details of EtcdServer.snapshot to see if it is possible?

@ah8ad3
Copy link
Contributor

ah8ad3 commented May 13, 2024

Thanks for answer. I'll try to understand the code and see what can we do. @siyuanfoundation

@ah8ad3
Copy link
Contributor

ah8ad3 commented May 13, 2024

Also can you please explain what is

something to clean up the old WAL entries

Clean from version tags? or delete?

@ah8ad3
Copy link
Contributor

ah8ad3 commented May 13, 2024

Think i got the answer from my last question.
I dive into code and find out we have an example directory of raft that seems to do the job we want to here, but i need confirmation that if i am in the right path for it? @siyuanfoundation @serathius

func (rc *raftNode) maybeTriggerSnapshot(applyDoneC <-chan struct{}) {

@ah8ad3
Copy link
Contributor

ah8ad3 commented May 15, 2024

I end up reading more about raft to understand the design, here is what stopped me from thinking about implementing offline snapshotter.
Screenshot from 2024-05-15 09-03-46

Based on this photo we can't decide offline which member to get snapshot from, we should only get it from online leader. So i think we should think about other suggestion which is

or just something to clean up the old WAL entries

Which i'm looking into it to see how to do that.

Update:
I didn't think about migrate command when writing this, i was thinking about whole project snapshot. If we get data-dir and only check that member we don't care about raft anymore.

@siyuanfoundation
Copy link
Contributor Author

I think you can do ReadConsistentIndex directly from the backend, and Compact (aka remove) any entries before the consistent index. I don't think we need to keep catchup entries in the migrate process, is that correct? @serathius

@ah8ad3
Copy link
Contributor

ah8ad3 commented May 19, 2024

Thank you @siyuanfoundation it seems ReadConsistentIndex will get apply id and will help but i have a little bit of problem in understanding and well implementing the downgrade which i ask in blow. If you think this need more consideration and we can ask it on community meeting or tag ask it on slack??!

I tested out current behavior of etcdutl migrate and etcdctl downgrade, it seems

  1. etcdctl downgrade 3.5 will change the version of db
    after running: etcd-dump-db iterate-bucket default.etcd/member/snap/db cluster --decode=true:
key="downgrade", value="{\"target-version\":\"3.5.0\",\"enabled\":true}"
key="clusterVersion", value="3.5.0"

and force snapshot which removes all entries from wal file
etcd-dump-logs default.etcd

Snapshot:
term=8 index=34 nodes=[8e9e05c52164694d] confstate={"voters":[10276657743932975437],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=8e9e05c52164694d clusterID=cdf818194e3a8c32 term=8 commitIndex=34 vote=8e9e05c52164694d
WAL entries: 0
term	     index	type	data

Entry types (Normal,ConfigChange) count is : 0
  1. running etcdutl migrate --data-dir default.etcd --target-version 3.5 --force successfully:
    after running: etcd-dump-db iterate-bucket default.etcd/member/snap/db cluster --decode=true:
key="downgrade", value="{\"target-version\":\"\",\"enabled\":false}"
key="clusterVersion", value="3.6.0"

Which doesn't change

and wal:
etcd-dump-logs default.etcd

Snapshot:
term=8 index=34 nodes=[8e9e05c52164694d] confstate={"voters":[10276657743932975437],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=8e9e05c52164694d clusterID=cdf818194e3a8c32 term=9 commitIndex=38 vote=8e9e05c52164694d
WAL entries: 4
lastIndex=38
term	     index	type	data
   9	        35	norm	
   9	        36	norm	header:<ID:7587878799003708674 > cluster_member_attr_set:<member_ID:10276657743932975437 member_attributes:<name:"default" client_urls:"http://localhost:2379" > > 
   9	        37	norm	header:<ID:7587878799003708677 > downgrade_info_set:<> 
   9	        38	norm	header:<ID:7587878799003708678 > cluster_version_set:<ver:"3.6.0" > 

Entry types (Normal,ConfigChange) count is : 4

It seems to me that etcdctl downgrade did a general better job on downgrading than etcdutl migrate, We need to remove cluster_version_set:<ver:"3.6.0" > entry maybe with compaction. But do you sure just compacting will be fine?

In addition i suggest we remove extra tools and add all utility on db and wal files (etcd-dump-logs, etcd-dump-dbs) into etcdutl. Having this many tools is a little bit hard to use and confusing.

@siyuanfoundation
Copy link
Contributor Author

Thank you @ah8ad3 for digging into this.
etcdctl downgrade works perfectly fine for the online downgrade, you need to start the etcdctl downgrade process before shutting down any node.
etcdutl migrate is meant to be used to migrate the files offline, say if you have the data files for 3.6, and you want to use them to start a 3.5 cluster. This issue is to make etcdutl migrate converge as much as possible as the etcdctl downgrade.

When you run etcdutl migrate, it will abort if there is any WAL entries that are not compatible with the old version, as described in this issue.
When you run etcdutl migrate --force, it would just change the storage the storage version, and pretend the migration is complete. I think you also uncovered that "clusterVersion" should also be changed too.

I think you make a really good point that there are too many tools. We can raise that in the community meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcdutl priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature
Development

No branches or pull requests

3 participants