Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer rocksdb-rs Repo to tikv org #166

Open
Little-Wallace opened this issue Mar 3, 2022 · 11 comments · May be fixed by #167
Open

Transfer rocksdb-rs Repo to tikv org #166

Little-Wallace opened this issue Mar 3, 2022 · 11 comments · May be fixed by #167

Comments

@Little-Wallace
Copy link
Contributor

Little-Wallace commented Mar 3, 2022

Background

To evolve towards the next generation database engine kernel, we reimplemented RocksDB using rust. This is our project address: https://github.com/rust-lib-project/rocksdb-rs. Of course, we will not implement all the functions of RocksDB. Our purpose is to create a more general KV data engine for TiKV services, not a transactional data engine for MyRocks services. In this project, we will eliminate most of the features that TiKV in RocksDB does not use, including transactions and two-phase commit code, in order to simplify our code as much as possible and make the project easier to maintain.
For the cloud ecosystem, asynchronous IO is essential for high-latency cloud disks. Fortunately, rust provides an asynchronous framework that is easier to use than C++, allowing us to easily write asynchronous IO code. Therefore, most of the interfaces of this project are provided as asynchronous method. Thanks to all interfaces supporting asynchrony, we can merge the compaction thread pool with other background thread pools in tikv to reduce thread switching and control CPU resources more precisely.
At present, rocksdb-rs still maintains similar interface methods and functions as rocksdb, but it adopts a low-coupling architecture, which can be easily refactored into a more friendly interface for TiKV in the future.

Roadmap

At present, the project has implemented the basic functions of rocksdb, including read and write queries and background merging. But it still lacks a lot of tests, and some key functions (you can see more details in https://github.com/rust-lib-project/rocksdb-rs/issues), I hope he can get help from more developers under the tikv organization and play a bigger role.

@BusyJay
Copy link
Member

BusyJay commented Mar 3, 2022

What's the relations between tikv/agatedb and rocksdb-rs? Can you describe more on the motivation of building a new rocksdb implementation instead of agatedb?

@Little-Wallace
Copy link
Contributor Author

Little-Wallace commented Mar 3, 2022

OK. I thought about creating my project based on agatedb before. But I found some critical issues preventing me from doing this.

  • agatedb does not support snapshot isolation. It means that we must do a lot of job to change TiKV.
  • agatedb does not implement the table format like rocksdb. It means that if we upgrade TiKV from rocksdb to agatedb we must overwrite all the data and it will cost a lot of time (maybe several hours) before TiKV could serve requests. And this also means a very high upgrade risk, because in the event of a data file corruption, users cannot easily fall back to an older version (from agatedb to rocksdb).
  • agatedb does not support asynchronous IO or asynchronous interface. It opens all files by mmap and does not support process manage block-cache just like rocksdb.

These issues reminds me that what I need is far away from agatedb. so I have to implement a new one. But I copy the code of memtable(skiplist) from agatedb and I think it is the only thing works for me...

@BusyJay
Copy link
Member

BusyJay commented Mar 3, 2022

agatedb does not support snapshot isolation.

agatedb is supposed to inherit all features badger supports, so it may support it at the moment, but it's supposed to support it in the end.

agatedb does not implement the table format like rocksdb

Indeed. This is the trade off between support compatibility and new features. But after we can divide LSM tree for regions, it may ease the pain for upgrading and downgrading. For example, using different engines for different regions.

agatedb does not support asynchronous IO or asynchronous interface.

I think this is similar to the first statement. Actually as explained the README of agatedb, the desire of writing a new engine in Rust is to explorer asynchronous support and unify thread menagements. It's part of its goal.

However, as one of the author of agatedb and TiKV maintainer, I personally happy to add this project as another experimental project in TiKV org to explorer further possibility as long as it's actively working on.

/cc @skyzh @zz-jason What's your opinions?

/cc @tikv/maintainers

@Little-Wallace
Copy link
Contributor Author

Little-Wallace commented Mar 3, 2022

Indeed. This is the trade off between support compatibility and new features. But after we can divide LSM tree for regions, it may ease the pain for upgrading and downgrading. For example, using different engines for different regions

I'm not against new data formats. I just think it's a better choice to support compatibility with the old format at first and we can upgrade the format step by step. As you mentioned, we can store most of regions in a old format and store some of them in a new format. For cloud environment, mmap is not supported by most shared storage and it means that you have to store data in memory. Obviously, we can not store all data in memory.

As of now, all interfaces of this project are synchronous, which means that if I want to use asynchronous interface, I need to rewrite almost all the code. It seems that no one cares the goal of agatedb....

I don't think the badger format is essential for cloud storage. But we can also implement the badger format in the rocksdb-rs. I think the most important thing is to support asynchronous IO so that the high latency of cloud disk (or S3 storage) won't block read thread.

@skyzh
Copy link
Member

skyzh commented Mar 3, 2022

rocksdb-rs is really a great project. From my perspective, it has nearly the same functionalities as agatedb, except some design goals. For example, agatedb has key-value separation and user-specified epoch built-in, while RocksDB and rocksdb-rs doesn't.

Firstly, I'd like to answer some questions regarding the agatedb project.

agatedb does not support snapshot isolation. It means that we must do a lot of job to change TiKV.

It indeed supports SSI. Every txn will create a snapshot on the current LSM tree, and agatedb should do conflict detection when writing (though this part is not implemented yet).

agatedb does not implement the table format like rocksdb.

That's true, and changing the SST format should be very simple. agatedb stores value pointer as a normal value in LSM, so any SST format should be okay for agatedb.

agatedb does not support asynchronous IO or asynchronous interface. It opens all files by mmap and does not support process manage block-cache just like rocksdb.

This is also true. mmap doesn't sound like a good way to run a storage engine.

I'm also happy if we could welcome a new experimental storage engine project to the TiKV organization, but I have some concerns regarding this transfer...

  • How will the development process go on in TiKV org? Looks like we need more developers working on rocksdb-rs, so as to get this project actively maintained and being used.
  • Does the code on main branch get adequate review? As you have already known, the complete agatedb source code is still on the develop branch, without being merged into master. When we was building yet another storage engine, we've found a lot of bugs in agatedb like [dev] dropPrefixes will cause dead loop agatedb#115 [dev] pending_write in txn cannot be read agatedb#114, which was not spotted by me at the time I was developing it. I think we should ensure there's no significant bug and get adequate review for rocksdb-rs.
  • How does rocksdb-rs perform compared with titan or rocksdb? There was a very simple benchmark tool for agatedb called agate_bench, and an internal doc about its performance comparison. I think developers might also be interested in benchmarking and using rocksdb-rs, and we should provide quick start for them to use rocksdb-rs in their libraries or doing benchmarks.
  • The name rocksdb-rs seems confusing at first sight. Some others might think of it as a Rust binding of RocksDB. What about having a new name for that before transferring, like tirocks or rustrocks?

@Little-Wallace
Copy link
Contributor Author

It indeed supports SSI. Every txn will create a snapshot on the current LSM tree, and agatedb should do conflict detection when writing (though this part is not implemented yet).

I do not think it is a good idea to support SSI in a kv engine. Because the transaction model may not be compatibility with the distributed transaction model of TiKV. So I suggest to implement a no transaction KV engine and then we can improvement it with developer of transaction developers.

How will the development process go on in TiKV org? Looks like we need more developers working on rocksdb-rs, so as to get this project actively maintained and being used.

Good point. I have finished the basic iterator and seek and write function. I need someone help me finished blockcache and compression and some other function. You'll find that this project has quite a few features done.

How does rocksdb-rs perform compared with titan or rocksdb? There was a very simple benchmark tool for agatedb called agate_bench, and an internal doc about its performance comparison. I think developers might also be interested in benchmarking and using rocksdb-rs, and we should provide quick start for them to use rocksdb-rs in their libraries or doing benchmarks.

I have benchmarked the write performance by single thread and I found that it can only reach 65% of RocksDB. Most of the write time comes from skiplist and the skiplist of rocksdb-rs is copied from agatedb....

The name rocksdb-rs seems confusing at first sight. Some others might think of it as a Rust binding of RocksDB. What about having a new name for that before transferring, like tirocks or rustrocks?

TiKV will name another project tirocks and maybe we can rethink another name.

@Little-Wallace
Copy link
Contributor Author

Does the code on main branch get adequate review? As you have already known, the complete agatedb source code is still on the develop branch, without being merged into master. When we was building yet another storage engine, we've found a lot of bugs in agatedb like tikv/agatedb#115 tikv/agatedb#114, which was not spotted by me at the time I was developing it. I think we should ensure there's no significant bug and get adequate review for rocksdb-rs.

This is why I don't want to introduce a complex transaction model in the storage engine, unless it is designed from the distributed transaction of TiDB as a whole. In the first step, we only need to design an engine with the same interface usage as rocksdb, and then we can support more functions according to TiDB's distributed transactions

@Little-Wallace
Copy link
Contributor Author

To make sure there's no significant bug , I will add more unit test for this project in the future development.

@skyzh
Copy link
Member

skyzh commented Mar 8, 2022

I have no objection (nor clear accept) for this proposal. Any ideas from other members?

@BusyJay
Copy link
Member

BusyJay commented Mar 8, 2022

I would like to also hear from @sunxiaoguang @zhangjinpeng1987

To make sure there's no significant bug , I will add more unit test for this project in the future development.

In addition, if it's going to be a serious project, changes needs to be reviewed and get at least two approvals before landing in its own master.

@Little-Wallace Little-Wallace linked a pull request Mar 16, 2022 that will close this issue
@Little-Wallace
Copy link
Contributor Author

@zhangjinpeng1987 @sunxiaoguang Any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants