Performance

After some discussion with Jae on the usability, it seems performance is a big concern. If every write takes around 1ms, that puts a serious upper limit on the speed of the consensus engine (especially since with the check/tx dichotomy, we need at least two writes (to cache, only one to disk) and likely two or more queries to handle any transaction).

As Jae notes: for CheckTx, a copy of IAVLTree doesn't need to be saved. During CheckTx it'll load inner nodes into the cache. The cache is shared w/ the AppendTx state IAVLTree, so during AppendTx we should save some time. There would only be 1 set of writes. Also, there's quite a bit of free time in between blocks as provided by Tendermint, during which CheckTx can run priming the cache, so hopefully this helps as well.

Jae: That said, I'm not sure exactly what the tx throughput would be during normal running times. I'm hoping that we can have e.g. 3 second blocks w/ say over a hundred txs per sec per block w/ 1 million items. That will get us through for some time, but that time is limited.

Ethan: I agree, and think this works now with goleveldb backing on most host machines. For public chains, maybe it is desired to push 1000 tx every 3 sec to a block, with a db size of 1 billion items. 10x the throughput with 1000x the data. That could be a long-term goal, and would scale to the cosmos and beyond.

Plan

For any goal, we need some clear steps.

Cleanup code, and write some more benchmark cases to capture "realistic" usage
Run tests on various hardware to see the best performing backing stores
Do profiling on the best performance to see if there are any easy performance gains
(Possibly) Write another implementation of merkle.Tree to improve all the memory overhead, consider CPU cache, etc....
(Possibly) Write another backend datastore to persist the tree in a more efficient way

The rest of this document is the planned or completed actions for the above-listed steps.

Cleanup

Done in branch cleanup_deps:

Fixed up dependeny management (tmlibs/db etc in glide/vendor)
Updated Makefile (test, bench, get_deps)
Fixed broken code - looper.go and one benchmark didn't run

Benchmarks should be parameterized on:

storage implementation
initial data size
length of keys
length of data
block size (frequency of copy/hash...) Thus, we would see the same benchmark run against memdb with 100K items, goleveldb with 100K, leveldb with 100K, memdb with 10K, goleveldb with 10K...

Scenarios to run after db is set up.

Pure query time (known/hits, vs. random/misses)
Write timing (known/updates, vs. random/inserts)
Delete timing (existing keys only)
TMSP Usage:
- For each block size:
  - 2x copy "last commit" -> check and real
  - repeat for each tx:
    - (50% update + 50% insert?)
    - query + insert/update in check
    - query + insert/update in real
  - get hash
  - save real
  - real -> "last commit"

Benchmarks

After writing the benchmarks, we can run them under various environments and store the results under benchmarks directory. Some useful environments to test:

Dev machines
Digital ocean small/large machine
Various AWS setups

Please run the benchmark on more machines and add the result. Just type: make record in the directory and wait a (long) while (with little other load on the machine).

This will require also a quick setup script to install go and run tests in these environments. Maybe some scripts even. Also, this will produce a lot of files and we may have to graph them to see something useful...

But for starting, my laptop, and one digital ocean and one aws server should be sufficient. At least to find the winner, before profiling.

Profiling

Once we figure out which current implementation looks fastest, let's profile it to make it even faster. It is great to optimize the memdb code to really speed up the hashing and tree-building logic. And then focus on the backend implementation to optimize the disk storage, which will be the next major pain point.

Some guides:

Profiling benchmarks locally
On optimizing memory
Profiling running programs
Dave Chenny's profiler pkg

Some ideas for speedups:

Speedup SHA256 100x on ARM
Faster SHA256 golang implementation
Data structure alignment
Slice alignment
Tool to analyze your structs

Tree Re-implementation

If we want to copy lots of objects, it becomes better to think of using memcpy on large (eg. 4-16KB) buffers than copying individual structs. We also could allocate arrays of structs and align them to remove a lot of memory management and gc overhead. That means going down to some C-level coding...

Some links for thought:

Array representation of a binary tree
Memcpy buffer size timing
Calling memcpy from go
Unsafe docs
...and how to use it
Or maybe just plain copy...

Backend implementation

Storing each link in the tree in leveldb treats each node as an isolated item. Since we know some usage patterns (when a parent is hit, very likely one child will be hit), we could try to organize the memory and disk location of the nodes ourselves to make it more efficient. Or course, this could be a long, slippery slope.

Inspired by the Array representation link above, we could consider other layouts for the nodes. For example, rather than store them alone, or the entire tree in one big array, the nodes could be placed in groups of 15 based on the parent (parent and 3 generations of children). Then we have 4 levels before jumping to another location. Maybe we just store this larger chunk as one leveldb location, or really try to do the mmap ourselves...

In any case, assuming around 100 bytes for one non-leaf node (3 sha hashes, plus prefix, plus other data), 15 nodes would be a little less than 2K, maybe even go one more level to 31 nodes and 3-4KB, where we could take best advantage of the memory/disk page size.

Some links for thought:

Memory mapped files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERFORMANCE.md

PERFORMANCE.md

Performance

Plan

Cleanup

Benchmarks

Profiling

Tree Re-implementation

Backend implementation

Files

PERFORMANCE.md

Latest commit

History

PERFORMANCE.md

File metadata and controls

Performance

Plan

Cleanup

Benchmarks

Profiling

Tree Re-implementation

Backend implementation