Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug write() with dump() #579

Closed
go-while opened this issue Oct 18, 2023 · 25 comments
Closed

debug write() with dump() #579

go-while opened this issue Oct 18, 2023 · 25 comments

Comments

@go-while
Copy link

go-while commented Oct 18, 2023

Hello World!

I'm debugging writes to bboltDB but the dump() function which exists in node.go is commented-out in write() and if enabled: won't help much and has bugs: does not work ...

Maybe one finds this snippet helpful.

Should i create a pull request?

An updated version can exist here: https://github.com/go-while/nntp-history/blob/main/_DEBUGS_bbolt_node.go

// edit days later
i dumped my debugs into this tree:
https://github.com/go-while/bbolt/tree/debugs

/*
 *  dump() function can be used in bbolt:node.go to debug writes.
 *  add n.dump() to the end of write() function in bbolt:node.go
 *
 */
// mod@date: 2023-10-18_12:00
var print_header bool = false
var printSpilled bool = true
var printUnbalanced bool = true
var print_isLeaf bool = false
var print_noLeaf bool = true
var print_item   bool = false
var print_dumped bool = false
var debugsleep0 = 0 * time.Millisecond // sleeps after printing: header
var debugsleep1 = 1000 * time.Millisecond // sleeps after printing: if spilled or unbalanced
var debugsleep2 = 0 * time.Millisecond // sleeps if is Leaf
var debugsleep3 = 1000 * time.Millisecond // sleeps if no Leaf

// dump writes the contents of the node to STDERR for debugging purposes.
func (n *node) dump() {
	// prints node header.
	var typ = "branch"
	if n.isLeaf {
		typ = "leaf"
	}
	now := time.Now().UnixNano()

	if print_header {
		log.Printf("NEW id=%d\n [NODE %d {type=%s inodes=%d}]\n n='%#v'", now, n.pgid, typ, len(n.inodes), n)
		time.Sleep(debugsleep0)
	}

	if printSpilled && n.spilled {
		log.Printf("!!! id=%d spil=true [NODE %d {type=%s inodes=%d}]", now, n.pgid, typ, len(n.inodes))
		time.Sleep(debugsleep1)
	} else
	if printUnbalanced && n.unbalanced {
		log.Printf("!!! id=%d unba=true [NODE %d {type=%s inodes=%d}]", now, n.pgid, typ, len(n.inodes))
		time.Sleep(debugsleep1)
	}

	for _, item := range n.inodes {
		if n.isLeaf {
			if print_isLeaf {
				if print_item {
					log.Printf("+++ id=%d write: isLeaf=1 key='%08x' item='%#v' pgid=%d key='%s' v='%s'", now, item.Key, item, n.pgid, string(item.Key()), string(item.Value()))
				} else {
					log.Printf("+++ id=%d write: isLeaf=1 pgid=%d buk=%s key='%s' v='%s'", now, n.pgid, "buk?", string(item.Key()), string(item.Value()))
				}
				time.Sleep(debugsleep2)
			}
		} else {
			if print_noLeaf {
				if print_item {
					log.Printf("+++ id=%d write: isLeaf=0 key='%08x' item='%#v' pgid=%d key='%s' v='%s'", now, item.Key, item, n.pgid, string(item.Key()), string(item.Value()))
				} else {
					log.Printf("+++ id=%d write: isLeaf=0 pgid=%d buk=%s key='%s' v='%s'", now, n.pgid, "buk?", string(item.Key()), string(item.Value()))
				}
				time.Sleep(debugsleep3)
			}

			//log.Printf("+B %08x -> pgid=%d", trunc(item.key, 4), item.pgid)
		}
	}
	if print_dumped {
		log.Printf("END id=%d dumped", now)
	}
} // end func dump
@go-while
Copy link
Author

go-while commented Oct 18, 2023

i need some help.

i'm working with 16 dbs at a time, still trying to find out what my problem is.
the test app in my link from first post goes into heavy writes after some million inserts.
i can see it in zfs iostat -v 1 flushing normally with frequency of zfs flush every 5s a few MB/s.
up to some point where it enters, i call it, full write state..
NVMEs are exhausted writing at their limits but dbsize is not growing a single byte?
and it does not stop writing with full power... load goes up to 20+ and performance of the machine degrades badly.

  • we store offsets from a history.log in the databases.
  • the his.log contains hashes of messages and meta data, in lines
  • the 1st char of hash selects the bboltDB. [0-9a-f]
  • chars 2,3,4 can be used as root buckets (choose length once at DB creation time)
  • the first N chars of remainder of the hash is further used to create sub buckets and can vary in length (1-5, choose length at creation)
  • the remainder of the hash (after cutting the sub bucket key) is then cut by a const "KeyLen" (4-...) and used as key in the root.sub.bucket to store offsets ([]int64 stored as "hex,hex,..," string).
    so the keys in a bucket are quite short. about 4-12 chars i'm testing, with a wide distribution over all these sub buckets.

can anyone shed some light on the const MinKeysPerPage in 'internals/common/page.go'?
it is default set to 2.
i've done some tests with much higher value for the MinKeysPerPage set to 2048 and this heavy writes pattern is not happening ...
i dont understand why it is 2?
or my problem is somewhere else, still not sure.

who knows how affects MinKeysPerPage and PageSize the write behavior?

i've toooo many numbers to play with now.

if you want to dive into it, here is a test app with lots of options.
https://github.com/go-while/nntp-history/tree/main/test

but i'm impressed with the good write performance, when it writes normally hits 100K+ tx/s with 4 core limit and very little write bandwidth used to update the databases.

I'm using the latest github commits now, not release. happens too with r1.3.7.

thanks to any reply in advance!

@tjungblu
Copy link
Contributor

Lots of things to unpack, it sounds very similar to #401 to a degree.

Can you create a CPU profile with pprof for us please? Maybe that sheds the light a bit better on your load spikes.

@go-while
Copy link
Author

go-while commented Oct 18, 2023

is creating one now.

with values:
16 dbs (always const)
256 root.buckets and 256 sub.buckets
4 KB pagesize
untouched bbolt from github
RootBucketFillPercent = 0.3
SubBucketFillPercent = 0.3
not sure about BucketFillPercentage, still trying to understand how it all goes together.

inserting 50M objects and mid way now
totalTX=48376845 load goes up ( Load average: 14.12 9.98 6.48 )
zfs is writing nvmse to death.
speed drops fast then.
...

load ?? ... 2023/10/18 17:31:41 BoltSpeed: 83784.84 tx/s ( did=837846 in 10.0 sec ) totalTX=2818893 otx=16
load ?? ... 2023/10/18 17:36:51 BoltSpeed: 77253.43 tx/s ( did=772431 in 10.0 sec ) totalTX=26857757 otx=17
load ?? ... 2023/10/18 17:39:41 BoltSpeed: 73491.27 tx/s ( did=734915 in 10.0 sec ) totalTX=38529775 otx=16
load ?? ... 2023/10/18 17:42:01 BoltSpeed: 68355.55 tx/s ( did=683556 in 10.0 sec ) totalTX=47210734 otx=17
load ?? ... 2023/10/18 17:43:41 BoltSpeed: 54298.25 tx/s ( did=542779 in 10.0 sec ) totalTX=52968505 otx=17
load ?? ... 2023/10/18 17:44:41 BoltSpeed: 30643.21 tx/s ( did=306420 in 10.0 sec ) totalTX=55776824 otx=17
did not watch load before...
load 21 ... 2023/10/18 17:47:11 BoltSpeed: 24353.59 tx/s ( did=243536 in 10.0 sec ) totalTX=61230020 otx=16
load 24 ... 2023/10/18 17:48:41 BoltSpeed: 19442.77 tx/s ( did=194427 in 10.0 sec ) totalTX=63462095 otx=17
load 28 ... 2023/10/18 17:49:41 BoltSpeed: 17559.13 tx/s ( did=175598 in 10.0 sec ) totalTX=64653220 otx=16
load 26 ... 2023/10/18 17:50:51 BoltSpeed: 27033.34 tx/s ( did=291902 in 10.8 sec ) totalTX=66440818 otx=16
load 28 ... 2023/10/18 17:53:51 BoltSpeed: 26905.38 tx/s ( did=269060 in 10.0 sec ) totalTX=70524504 otx=16
load 29 ... 2023/10/18 17:57:31 BoltSpeed: 14351.21 tx/s ( did=143435 in 10.0 sec ) totalTX=75259892 otx=16
load 30 ... 2023/10/18 18:00:31 BoltSpeed: 13611.78 tx/s ( did=136188 in 10.0 sec ) totalTX=78494672 otx=16
load 29 ... 2023/10/18 18:02:31 BoltSpeed: 19562.36 tx/s ( did=192527 in 9.8 sec ) totalTX=80563669 otx=16
load 28 ... 2023/10/18 18:04:51 BoltSpeed: 21886.75 tx/s ( did=220217 in 10.1 sec ) totalTX=83815074 otx=16
load 28 ... 2023/10/18 18:07:31 BoltSpeed: 14359.01 tx/s ( did=143734 in 10.0 sec ) totalTX=86675446 otx=16
load 29 ... 2023/10/18 18:09:31 BoltSpeed: 21194.12 tx/s ( did=212162 in 10.0 sec ) totalTX=89247666 otx=16
load 29 ... 2023/10/18 18:13:11 BoltSpeed: 11818.67 tx/s ( did=118094 in 10.0 sec ) totalTX=93515395 otx=16

now the process is getting weird. cpu usage drops, load rises.
and i see warnings of half-full queues which didn't show up until now in the process. (i've seen them before when things go hell)

load 29 ...2023/10/18 18:14:31 WARN boltBucketPutBatch [c|f2] Q=157/Qcap=256 wCBBS=256 forcing flush!
load 29 ... 2023/10/18 18:16:21 BoltSpeed: 24200.96 tx/s ( did=235273 in 9.7 sec ) totalTX=96923776 otx=16

zfs won't stop writing

will wait for it to complete (100M tx (search/inserts 50/50)) and post the cpu profile

//edit
yes i've seen #401. have a ref in my code ... BBOLTDB.go:299

maybe it's just my fault, hitting bbolt too much xD

@go-while
Copy link
Author

cpu prof

log file

thank you!

@go-while
Copy link
Author

what GPT says to PEEK log from profile

Certainly, here's a summary of the functions that consumed the most CPU time, along with their line endings:

  1. github.com/go-while/nntp-history.(*L3CACHE).L3Cache_Thread.func2 (line ending: /tank0/efs/GO/src/github.com/go-while/nntp-history/test/src/github.com/go-while/nntp-history/L3CACHE.go:143)

  2. github.com/go-while/nntp-history.(*L1CACHE).L1Cache_Thread.func2 (line ending: /tank0/efs/GO/src/github.com/go-while/nntp-history/test/src/github.com/go-while/nntp-history/L1CACHE.go:210)

  3. github.com/go-while/nntp-history.(*L2CACHE).L2Cache_Thread.func2 (line ending: /tank0/efs/GO/src/github.com/go-while/nntp-history/test/src/github.com/go-while/nntp-history/L2CACHE.go:152)

  4. go.etcd.io/bbolt.(*Tx).write (line ending: /tank0/efs/GO/src/github.com/go-while/nntp-history/test/src/go.etcd.io/bbolt/tx.go:494)

  5. runtime.add (line ending: /tank0/efs/GO/.bin/go1.21.2/src/runtime/stubs.go:18)

These functions are the top consumers of CPU time in your Go application, and optimizing them should lead to significant performance improvements.

@go-while
Copy link
Author

go-while/nntp-history/L3CACHE.go:143

				//getexpired:
				for key, item := range ptr.cache {

go-while/nntp-history/L1CACHE.go:210

				//getexpired:
				for hash, item := range ptr.cache {

go-while/nntp-history/L2CACHE.go:152

				//getexpired:
				for offset, item := range ptr.cache {

this wasn't very helpful for me 👯
there are lots of items in the different caches [0-9a-f] and there is a lot of work but this should not affect boltdbs writing in any way... or at least I don't know why it should affect it.

@go-while
Copy link
Author

go-while commented Oct 19, 2023

ok cool all bad map iters are gone and found queue congestion which should be fixed, but the problem with the writes is still happening. will pull it after some testing into the repo.

i tested a lot of stuff these days and as far as i can tell, the pagesize alone does not solve the problem.
i played with all the variables i could find and i'm stuck with rising the MinKeysPerPage value.
testing now with 64K pagesize and 'const MinKeysPerPage = 128'
50M tx in and we hover at 80-90K tx/sec which is good.
zfs is cool. i see it flushing in normal frequency, nothing bad. ram is about 3.5G.
Load average: 6.72 7.39 8.04

i did not touch the code in my repo, code lines still match the cpu profile.
maybe you see something. i'd be very happy!

good night.

// edit

a little slowdown to the end
2023/10/19 03:16:31 BoltSpeed: 69284.70 tx/s ( did=691821 in 10.0 sec ) totalTX=89421977 otx=16
Load average: 11.35 10.53 9.26
i can see zfs writing more frequently with less pauses and few seconds of continues writes.

but test is now finished and has never been that fast 🥇

50M x 4 jobs = 50M searches, 50M inserts and 200M hits to cache. not bad 👯
2023/10/19 03:19:31 done=200000000 (took 1294 seconds) (closewait 6 seconds)
Alloc: 1979 MiB, TotalAlloc: 802648 MiB, Sys: 3336 MiB, NumGC: 541

tree/L1prioQue/test
updated cpu profile
updated cpu logfile

wow i can't sleep now. rescanning the db with same values dupe checking: 90k-100k tx/sec pure reads. 👯

@go-while
Copy link
Author

go-while commented Oct 19, 2023

2023/10/19 07:12:50 1697692370252293163 split() pagesize='4096' nodes=2 n.inodes=6 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370252496266 splitTwo pagesize='4096' (n.inodes=6 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=6
2023/10/19 07:12:50 1697692370252502507 split() pagesize='4096' nodes=2 n.inodes=6 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370252702502 splitTwo pagesize='4096' (n.inodes=8 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=8
2023/10/19 07:12:50 1697692370252707171 split() pagesize='4096' nodes=2 n.inodes=8 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370252759513 splitTwo pagesize='4096' (n.inodes=6 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=6
2023/10/19 07:12:50 1697692370252764333 split() pagesize='4096' nodes=2 n.inodes=6 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370253169850 splitTwo pagesize='4096' (n.inodes=7 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=7
2023/10/19 07:12:50 1697692370253176809 split() pagesize='4096' nodes=2 n.inodes=7 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370253192470 splitTwo pagesize='4096' (n.inodes=5 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=5
2023/10/19 07:12:50 1697692370253196873 split() pagesize='4096' nodes=2 n.inodes=5 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370254264268 splitTwo pagesize='4096' (n.inodes=5 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=5
2023/10/19 07:12:50 1697692370254272712 split() pagesize='4096' nodes=2 n.inodes=5 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370256154835 splitTwo pagesize='4096' (n.inodes=6 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=6
2023/10/19 07:12:50 1697692370256163311 split() pagesize='4096' nodes=2 n.inodes=6 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370256188272 splitTwo pagesize='4096' (n.inodes=6 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=6
2023/10/19 07:12:50 1697692370256192864 split() pagesize='4096' nodes=2 n.inodes=6 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370256242848 splitTwo pagesize='4096' (n.inodes=7 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=7
2023/10/19 07:12:50 1697692370256249612 split() pagesize='4096' nodes=2 n.inodes=7 common.MinKeysPerPage=2
2023/10/19 07:12:50 splitIndex threshold=3686
2023/10/19 07:12:50 1697692370256670093 splitTwo pagesize='4096' (n.inodes=7 common.MinKeysPerPage=2*2 || n.sizeLessThan=true) threshold=3686 splitIndex=7

with the default 4K pagesize it constantly splits inodes and i guess the heavy writes are coming from this?

there is this if clause in node.go splitTwo() . can anyone explain why this values had been choosen?

	if len(n.inodes) <= (common.MinKeysPerPage*2) || n.sizeLessThan(pageSize) {
		//log.Printf("%d !splitTwo pagesize='%d' (n.inodes=%d <= common.MinKeysPerPage=%d*2 || n.sizeLessThan=%t)", time.Now().UnixNano(), pageSize, len(n.inodes), common.MinKeysPerPage, n.sizeLessThan(pageSize))
		return n, nil
	}
// place in node.go splitTwo() before the return
	log.Printf("%d splitTwoB pagesize='%d' (n.inodes=%d common.MinKeysPerPage=%d*2 || n.sizeLessThan=%t) threshold=%d splitIndex=%d", time.Now().UnixNano(), pageSize, len(n.inodes), common.MinKeysPerPage, n.sizeLessThan(pageSize), threshold, splitIndex)
	return n, next
}

modded it to this for now

// splitTwo breaks up a node into two smaller nodes, if appropriate.
// This should only be called from the split() function.
func (n *node) splitTwo(pageSize uintptr) (*node, *node) {
	// Ignore the split if the page doesn't have at least enough nodes for
	// two pages or if the nodes can fit in a single page.
	if len(n.inodes) <= (common.MinKeysPerPage*2) || n.sizeLessThan(pageSize) {
		//log.Printf("%d !splitTwo pagesize='%d' (n.inodes=%d <= common.MinKeysPerPage=%d*2 || n.sizeLessThan=%t)", time.Now().UnixNano(), pageSize, len(n.inodes), common.MinKeysPerPage, n.sizeLessThan(pageSize))
		return n, nil
	}
	log.Printf("%d splitTwoA pagesize='%d' (n.inodes=%d common.MinKeysPerPage=%d*2 || n.sizeLessThan=%t)", time.Now().UnixNano(), pageSize, len(n.inodes), common.MinKeysPerPage, n.sizeLessThan(pageSize))

...
...

@tjungblu
Copy link
Contributor

Thanks for all the information.

with the default 4K pagesize it constantly splits inodes and i guess the heavy writes are coming from this?

it certainly seems the splitting causes a lot of GC activity.

The calls to os.(*File).WriteAt don't look all that exceptional. Maybe ZFS is fragmented after that many writes and needs to reshuffle blocks? Could you maybe try with a boring ext4 partition and see if the issues stay?

@go-while
Copy link
Author

i can't test on ext4 at the moment. no more hardware. could setup a VM but still zfs on the host.
but i don't believe this is fragmentation issue.
ZFS is on 4x 4TB NVME (samsung pros) in a striped-mirror (like raid10) and heavy write load is then about near their hardwarelimits, writing with 500-800 MByte/sec each continuesly...
the nvmes can handle a 10Gbit lan read/write without problems and we have 90% free space.
compression is enabled which further reduces read/write bandwidth with compressable data. i can see 2-3x compressratios on databases.
all reads are coming from RAM, i haven't seen any read loads on zfs while testing. not even while full re-scanning without writing. peaks write performance easily triples until their caches get full.
there is nothing else on the machine besides the OS and binary, only me testing bbolt there. :/
the question for me is: where the stuff is going, because with that write loads from bbolt, one would think DBs to grow like crazy, but they don't grow with that speed, not even near. they almost do nothing related to size. something is writing more than needed, i believe. but still not sure where to look at.
we're talking about less than 1GB of database size at the end of 50M inserts, but a constant write rate of lets say more than 1gbyte/sec... the DB should blow up... there is something in bbolt and i want to find out where. spent too much time into this funny little project now.

@go-while
Copy link
Author

go-while commented Oct 19, 2023

i dumped my debugs into this tree:
https://github.com/go-while/bbolt/tree/debugs

with defaults 4KB pagesize and a fresh db i can see this:

2023/10/19 16:11:05 1697724665147285003 splitTwo determine pagesize='4096' (n.inodes=50 common.MinKeysPerPage=2*2 || n.sizeLessThan=false)
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=0 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=89 elsize=107 sz=16
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=1 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=123
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=2 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=202
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=3 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=281
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=4 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=360
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=5 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=439
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=6 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=518
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=7 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=597
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=8 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=676
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=9 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=755
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=10 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=834
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=11 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=913
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=12 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=60 elsize=78 sz=992
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=13 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=60 elsize=78 sz=1070
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=14 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1148
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=15 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1227
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=16 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1306
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=17 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=89 elsize=107 sz=1385
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=18 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1492
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=19 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=89 elsize=107 sz=1571
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=20 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1678
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=21 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1757
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=22 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1836
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=23 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1915
2023/10/19 16:11:05 1697724665147285003 splitIndex forloop index=24 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1994
2023/10/19 16:11:05 1697724665147285003 splitIndex break @ index=24 n.inodes=50 n.pageElementSize=16 keyLen=2 valLen=61 elsize=79 sz=1994 threshold=2048
...
...
...
...
2023/10/19 16:16:35 1697724995627077478 splitTwo determine pagesize='4096' (n.inodes=16 common.MinKeysPerPage=2*2 || n.sizeLessThan=false)
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=0 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=124 elsize=142 sz=16
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=1 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=215 elsize=233 sz=158
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=2 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=154 elsize=172 sz=391
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=3 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=401 elsize=419 sz=563
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=4 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=185 elsize=203 sz=982
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=5 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=276 elsize=294 sz=1185
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=6 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=217 elsize=235 sz=1479
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=7 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=156 elsize=174 sz=1714
2023/10/19 16:16:35 1697724995627077478 splitIndex forloop index=8 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=246 elsize=264 sz=1888
2023/10/19 16:16:35 1697724995627077478 splitIndex break @ index=8 n.inodes=16 n.pageElementSize=16 keyLen=2 valLen=246 elsize=264 sz=1888 threshold=2048
...
...
...
2023/10/19 16:24:00 1697725440019337071 splitTwo determine pagesize='4096' (n.inodes=6 common.MinKeysPerPage=2*2 || n.sizeLessThan=false)
2023/10/19 16:24:00 1697725440019337071 splitIndex forloop index=0 n.inodes=6 n.pageElementSize=16 keyLen=2 valLen=554 elsize=572 sz=16
2023/10/19 16:24:00 1697725440019337071 splitIndex forloop index=1 n.inodes=6 n.pageElementSize=16 keyLen=2 valLen=649 elsize=667 sz=588
2023/10/19 16:24:00 1697725440019337071 splitIndex forloop index=2 n.inodes=6 n.pageElementSize=16 keyLen=2 valLen=618 elsize=636 sz=1255
2023/10/19 16:24:00 1697725440019337071 splitIndex forloop index=3 n.inodes=6 n.pageElementSize=16 keyLen=2 valLen=894 elsize=912 sz=1891
2023/10/19 16:24:00 1697725440019337071 splitIndex break @ index=3 n.inodes=6 n.pageElementSize=16 keyLen=2 valLen=894 elsize=912 sz=1891 threshold=2048

over time "valLen=" grows and grows near threshold and the splitIndex gets smaller and smaller reaching to 1 digits and falling low as MinKeyLen.

I'm honest: i have no idea what I'm doing. maybe it's just normal how btrees work and grow, but i don't get it xD

i will fill one with 50M now and try to compact it afterwards and fill another 50M into to see what happens.

@go-while
Copy link
Author

go-while commented Oct 19, 2023

with 32KB pagesize

023/10/19 16:27:28 1697725648796881482 splitTwo determine pagesize='32768' (n.inodes=117 common.MinKeysPerPage=2*2 || n.sizeLessThan=false)
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=0 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=217 elsize=235 sz=16
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=1 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=186 elsize=204 sz=251
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=2 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=217 elsize=235 sz=455
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=3 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=308 elsize=326 sz=690
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=4 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=279 elsize=297 sz=1016
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=5 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=218 elsize=236 sz=1313
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=6 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=428 elsize=446 sz=1549
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=7 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=309 elsize=327 sz=1995
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=8 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=186 elsize=204 sz=2322
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=9 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=335 elsize=353 sz=2526
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=10 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=368 elsize=386 sz=2879
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=11 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=155 elsize=173 sz=3265
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=12 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=186 elsize=204 sz=3438
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=13 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=306 elsize=324 sz=3642
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=14 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=245 elsize=263 sz=3966
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=15 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=249 elsize=267 sz=4229
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=16 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=308 elsize=326 sz=4496
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=17 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=183 elsize=201 sz=4822
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=18 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=366 elsize=384 sz=5023
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=19 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=245 elsize=263 sz=5407
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=20 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=155 elsize=173 sz=5670
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=21 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=308 elsize=326 sz=5843
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=22 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=217 elsize=235 sz=6169
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=23 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=248 elsize=266 sz=6404
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=24 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=247 elsize=265 sz=6670
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=25 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=305 elsize=323 sz=6935
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=26 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=430 elsize=448 sz=7258
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=27 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=247 elsize=265 sz=7706
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=28 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=215 elsize=233 sz=7971
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=29 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=125 elsize=143 sz=8204
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=30 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=216 elsize=234 sz=8347
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=31 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=278 elsize=296 sz=8581
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=32 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=154 elsize=172 sz=8877
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=33 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=339 elsize=357 sz=9049
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=34 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=277 elsize=295 sz=9406
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=35 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=306 elsize=324 sz=9701
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=36 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=370 elsize=388 sz=10025
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=37 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=124 elsize=142 sz=10413
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=38 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=245 elsize=263 sz=10555
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=39 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=464 elsize=482 sz=10818
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=40 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=248 elsize=266 sz=11300
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=41 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=246 elsize=264 sz=11566
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=42 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=308 elsize=326 sz=11830
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=43 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=92 elsize=110 sz=12156
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=44 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=186 elsize=204 sz=12266
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=45 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=246 elsize=264 sz=12470
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=46 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=215 elsize=233 sz=12734
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=47 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=428 elsize=446 sz=12967
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=48 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=276 elsize=294 sz=13413
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=49 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=123 elsize=141 sz=13707
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=50 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=430 elsize=448 sz=13848
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=51 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=249 elsize=267 sz=14296
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=52 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=154 elsize=172 sz=14563
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=53 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=308 elsize=326 sz=14735
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=54 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=277 elsize=295 sz=15061
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=55 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=306 elsize=324 sz=15356
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=56 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=217 elsize=235 sz=15680
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=57 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=341 elsize=359 sz=15915
2023/10/19 16:27:28 1697725648796881482 splitIndex forloop index=58 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=215 elsize=233 sz=16274
2023/10/19 16:27:28 1697725648796881482 splitIndex break @ index=58 n.inodes=117 n.pageElementSize=16 keyLen=2 valLen=215 elsize=233 sz=16274 threshold=16384

i see much less splits, almost no splits up to 15M inserts (30M tx with searches)
then it starts splitting again and won't stop and write bandwidth goes up.

// edit at 40M total tx it stopped the splitting war and write bandwidth is low again, almost no existent. normal zfs flush frequency.

@go-while
Copy link
Author

never give up!
removed sub/nested bucketing and will try with 256 root buckets in every of the 16 dbs to put key/vals in (int to hex with space seperator ).
keyLen is 8 and most of stored values are not much longer.
almost no duplicates per key, means no multiple offsets per key.
write bandwidth looks better too, much less than with nested buckets, more pauses between zfs flush and lower peaks.
so i guess, nested buckets don't behave like i thought they would.

now with bbolt everything at defaults and 4KB pagesize
100M tx and problem did not show up yet.
since the 50M mark you can see some longer zfs flushing with more bandwidth used but not continuesly, still see frequently flush pauses.

i believe since pages need to be split when reach the tresh, and we feed at constant pace... pages get split by 2, means in any point in the future we have to split the double amount of pages when they get full. and repeat. is somehow exponential?

key_add=49999923 key_app=77 fseeks=75
done=200000000 (took 1458 seconds)

@go-while
Copy link
Author

i've never looked at iotop but that's burning 🤣 Total DISK WRITE: 580.49 M/s
while zfs is not writing anything.... flushing normally.

@go-while
Copy link
Author

go-while commented Oct 19, 2023

when you don't know what your own vars can do 👯
typical pebcac. that's how we learn and evolve 🚀
there is a "BatchFlushEvery" variable in the testapp. this was 2500ms all the times.
I've set it to batch flush our requests every 10s to boltdb now and the writes and wirte bandwidth got much lower.

while writing this there is a split happening (128K pagesize i'm testing) at 35M tx in but ZFS stayed cool. see it flush for few 100 milliseconds only. 😮

and split again, then you see a load of zfs write bandwidth but everything in normal range.

2023/10/19 19:50:18 BoltSpeed: 95592.68 tx/s ( did=955918 in 10.0 sec ) totalTX=18930332 otx=16
2023/10/19 19:50:28 WatchBolt: (inserted 45441/s) (+454419 inserted in 10s)
2023/10/19 19:50:28 WatchBolt: (batchins 391/s) (+3913 batchins in 10s) medBS=~109
2023/10/19 19:50:28 WatchBolt: (searches 46400/s) (+464000 searches in 10s)
2023/10/19 19:50:28 L1: [fex=0/set:9930119] [get=0/mux=9930119] [del:9439766/bat:9447857] [g/s:0/0] cached:490353 (~30647/char)
2023/10/19 19:50:28 L2: [fex=2/set:9930119] [get=0/mis=0] [del:9439990/bat:9702131] [g/s:0/0] cached:490133 (~30633/char)
2023/10/19 19:50:28 L3: [fex=0/set:9930119] [get=0/mis=0] [del:9439765/bat:9447856] [g/s:0/0] cached:490353 (~30647/char)
2023/10/19 19:50:28 BoltSpeed: 92928.93 tx/s ( did=930270 in 10.0 sec ) totalTX=19860602 otx=16
2023/10/19 19:52:50 1697737970250138442 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970256381548 splitIndex break @ index=1995 n.inodes=4061 n.pageElementSize=16 keyLen=8 valLen=8 elsize=32 sz=65513 threshold=65536
2023/10/19 19:52:50 1697737970256381548 splitTwo newParent pagesize='131072' (n.inodes=4061 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970256381548 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970275315118 splitIndex break @ index=1995 n.inodes=3998 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65534 threshold=65536
2023/10/19 19:52:50 1697737970275315118 splitTwo newParent pagesize='131072' (n.inodes=3998 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970275315118 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970276073188 splitIndex break @ index=1994 n.inodes=4020 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65513 threshold=65536
2023/10/19 19:52:50 1697737970276073188 splitTwo newParent pagesize='131072' (n.inodes=4020 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1994
2023/10/19 19:52:50 1697737970276073188 splitTwo returned@ pagesize='131072' n.inodes=1994/2*2 threshold=65536 splitIndex=1994
2023/10/19 19:52:50 1697737970277572230 splitIndex break @ index=1995 n.inodes=4035 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65526 threshold=65536
2023/10/19 19:52:50 1697737970277572230 splitTwo newParent pagesize='131072' (n.inodes=4035 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970277572230 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970321967503 splitIndex break @ index=1995 n.inodes=4017 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65524 threshold=65536
2023/10/19 19:52:50 1697737970321967503 splitTwo newParent pagesize='131072' (n.inodes=4017 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970321967503 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970322768950 splitIndex break @ index=1995 n.inodes=4063 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65521 threshold=65536
2023/10/19 19:52:50 1697737970322768950 splitTwo newParent pagesize='131072' (n.inodes=4063 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970322768950 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970336452985 splitIndex break @ index=1995 n.inodes=4042 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65520 threshold=65536
2023/10/19 19:52:50 1697737970336452985 splitTwo newParent pagesize='131072' (n.inodes=4042 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970336452985 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970362805778 splitIndex break @ index=1996 n.inodes=3998 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65521 threshold=65536
2023/10/19 19:52:50 1697737970362805778 splitTwo newParent pagesize='131072' (n.inodes=3998 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1996
2023/10/19 19:52:50 1697737970362805778 splitTwo returned@ pagesize='131072' n.inodes=1996/2*2 threshold=65536 splitIndex=1996
2023/10/19 19:52:50 1697737970363002594 splitIndex break @ index=1996 n.inodes=4024 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65519 threshold=65536
2023/10/19 19:52:50 1697737970363002594 splitTwo newParent pagesize='131072' (n.inodes=4024 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1996
2023/10/19 19:52:50 1697737970363002594 splitTwo returned@ pagesize='131072' n.inodes=1996/2*2 threshold=65536 splitIndex=1996
2023/10/19 19:52:50 1697737970366462978 splitIndex break @ index=1995 n.inodes=4045 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65536 threshold=65536
2023/10/19 19:52:50 1697737970366462978 splitTwo newParent pagesize='131072' (n.inodes=4045 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970366462978 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970375120406 splitIndex break @ index=1995 n.inodes=3995 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65517 threshold=65536
2023/10/19 19:52:50 1697737970375120406 splitTwo newParent pagesize='131072' (n.inodes=3995 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970375120406 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970377918820 splitIndex break @ index=1995 n.inodes=4036 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65509 threshold=65536
2023/10/19 19:52:50 1697737970377918820 splitTwo newParent pagesize='131072' (n.inodes=4036 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970377918820 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970379005367 splitIndex break @ index=1995 n.inodes=4039 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65523 threshold=65536
2023/10/19 19:52:50 1697737970379005367 splitTwo newParent pagesize='131072' (n.inodes=4039 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970379005367 splitTwo returned@ pagesize='131072' n.inodes=1995/2*2 threshold=65536 splitIndex=1995
2023/10/19 19:52:50 1697737970380971296 splitIndex break @ index=1995 n.inodes=4023 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=65505 threshold=65536
2023/10/19 19:52:50 1697737970380971296 splitTwo newParent pagesize='131072' (n.inodes=4023 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=65536 splitIndex=1995

...
...

Load average: 8.69 8.68 9.91
5,6,7G of ram
and set bbolt MaxBatchDelay to 100 ms

oops. houston we have too many variables.
this test is running with NoSync true, NoGrowSync true, NoFreelistSync true
we have own syncing implemented which locks writer to sync every 500k inserts per db or 60s. takes about 0ms to run.

rescan runs with 90k read tx/sec, not bad near 💯k ... and mem stable at 5.1G

//
rescan done. 50M x 4 jobs = 200M cachehits, 50M reads. in less then 10 minutes.

2023/10/19 20:23:42 done=200000000 (took 585 seconds) (closewait 3 seconds)
Alloc: 2006 MiB, TotalAlloc: 86054 MiB, Sys: 2906 MiB, NumGC: 53

du -hs hashdb/50M/
1,2G    hashdb/50M/

zfs recordsize is at 16K

nntp-history/hashdb  compressratio  2.06x

zfs list
NAME                        USED
tank0/nntp-history/hashdb  1.16G

hashdb/50M$ du -hs *
74M     history.dat.hash.0
72M     history.dat.hash.1
74M     history.dat.hash.2
75M     history.dat.hash.3
76M     history.dat.hash.4
74M     history.dat.hash.5
76M     history.dat.hash.6
77M     history.dat.hash.7
77M     history.dat.hash.8
76M     history.dat.hash.9
73M     history.dat.hash.a
76M     history.dat.hash.b
73M     history.dat.hash.c
73M     history.dat.hash.d
74M     history.dat.hash.e
73M     history.dat.hash.f

hashdb/50M$ du -b *
201457664       history.dat.hash.0
197656576       history.dat.hash.1
202375168       history.dat.hash.2
207224832       history.dat.hash.3
208011264       history.dat.hash.4
203816960       history.dat.hash.5
208273408       history.dat.hash.6
213254144       history.dat.hash.7
210632704       history.dat.hash.8
207355904       history.dat.hash.9
200146944       history.dat.hash.a
210763776       history.dat.hash.b
200409088       history.dat.hash.c
200015872       history.dat.hash.d
203423744       history.dat.hash.e
204210176       history.dat.hash.f
2023/10/19 21:06:25 1697742385316289224 splitTwo newParent pagesize='262144' (n.inodes=7993 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=131072 splitIndex=3981
2023/10/19 21:06:25 1697742385316289224 splitTwo returned@ pagesize='262144' n.inodes=3981/2*2 threshold=131072 splitIndex=3981
2023/10/19 21:06:25 1697742385383028640 splitIndex break @ index=3981 n.inodes=7980 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=131069 threshold=131072
2023/10/19 21:06:25 1697742385383028640 splitTwo newParent pagesize='262144' (n.inodes=7980 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=131072 splitIndex=3981
2023/10/19 21:06:25 1697742385383028640 splitTwo returned@ pagesize='262144' n.inodes=3981/2*2 threshold=131072 splitIndex=3981
2023/10/19 21:06:25 RUN test p=2 nntp-history added=7977214 dupes=0 cLock=12133921 addretry=0 retry=0 adddupes=0 cdupes=0 cretry1=11888865 cretry2=0 32000000/50000000
2023/10/19 21:06:25 RUN test p=1 nntp-history added=8014512 dupes=0 cLock=12140183 addretry=0 retry=0 adddupes=0 cdupes=0 cretry1=11845305 cretry2=0 32000000/50000000
2023/10/19 21:06:25 RUN test p=4 nntp-history added=7993694 dupes=0 cLock=12131730 addretry=0 retry=0 adddupes=0 cdupes=0 cretry1=11874576 cretry2=0 32000000/50000000
2023/10/19 21:06:25 RUN test p=3 nntp-history added=8014580 dupes=0 cLock=12139764 addretry=0 retry=0 adddupes=0 cdupes=0 cretry1=11845656 cretry2=0 32000000/50000000
2023/10/19 21:06:26 1697742386057204104 splitIndex break @ index=3981 n.inodes=7981 n.pageElementSize=16 keyLen=8 valLen=9 elsize=33 sz=131040 threshold=131072
2023/10/19 21:06:26 1697742386057204104 splitTwo newParent pagesize='262144' (n.inodes=7981 common.MinKeysPerPage=2*2 || n.sizeLessThan=false) threshold=131072 splitIndex=3981
2023/10/19 21:06:26 1697742386057204104 splitTwo returned@ pagesize='262144' n.inodes=3981/2*2 threshold=131072 splitIndex=3981
...

when it splits pages you can see more writes happening, else it's now completely silent.
NoSync and the other 2 are set to false again.

@go-while
Copy link
Author

go-while commented Oct 19, 2023

this is very interesting now. tuned some numbers from the app and bbolt runs smoothly 🤣
do batchflushing every 15sec in the app to bbolt and delayed workers to start within this timeframe to get a better distribution for flushing.

for {
    delay := j * int(BatchFlushEvery) / len(ROOTBUCKETS)
    j++
   go bootworker(delay)
}
/* 
j is the worker id, should start with 1, not 0 or math fails.
BatchFlushEvery we have 15000 ms.
len(ROOTBUCKETS) is the total number of workers
i named it rootbuckets, because we create N rootbuckets 
and keep the chars that make the rootbuckets [0-9a-f][0-9a-f][0-9a-f]... in this slic
so we can iterate over one slice, not 3 for-loops to generate the 3 depths levels again where needed.
you'd need a struct{} chan with size of workers
and wait until all booted workers pushed their struct{}{} 
into this channel before starting rest of your app.
if anybody can come up with a better idea, i'd glad to read it :)
*/

256 batchqueue workers per db, all booted at same time and all hit the bboltdbs at almost the same time.

now boltdb has maxbatchdelay 100ms and maxbatchsize 16384?

zfs write bandwidth is near 0.
you see it writing low megabytes few milliseconds with long pauses.

conclusion.
distribute your batch writes over time, don't hit bbolt too hard with too many batches from too many go routines.

stable insert at peaks 120k tx/sec (50/50 search/insert) now !

the nvmes don't help much for read performance as there are no reads from zfs, all ram... how to get it faster? 🍡

@go-while
Copy link
Author

go-while commented Oct 20, 2023

nested buckets are back in game again and working too. no issues.

let's go to the moon, new target: 1b keys. 100M it is doing laughingly.

256 workers, one for every rootbucket with own queue.
if we delay boot of workers to start within 16384 ms / 256 workers.
every 60ms would one bucket but its batch to bbolt.
but i've set batchdelay to 100ms. how will this correlate.
maybe delay start to 30s or lower batchdelay to 50ms?
trial and error.

@go-while
Copy link
Author

the most interesting point to look at while debugging is the splitTwo() function in bbolt:node.go

	log.Printf("%d splitTwo returned@ pagesize='%d' n.inodes=%d/%d*2 threshold=%d splitIndex=%d fillPercent=%f", now, pageSize, len(n.inodes), common.MinKeysPerPage, threshold, splitIndex, fillPercent)
	return

speed is really good 👯

2023/10/21 22:15:39 WatchBolt: (inserted 136566/s) (+1365661 inserted in 10s)
2023/10/21 22:15:39 WatchBolt: (batchins 402/s) (+4027 batchins in 10s) medBS=~350
2023/10/21 22:15:39 WatchBolt: (searches 76800/s) (+768000 searches in 10s)

@tjungblu
Copy link
Contributor

Glad you've got it to work fast 💪
Anything that would need to make its way back into bbolt?

@go-while
Copy link
Author

go-while commented Oct 24, 2023

hi!

i'm debugging the hell out of bbolt and maybe i've found something in db.go Batch() in this line

	if (db.batch == nil) || (db.batch != nil && len(db.batch.calls) >= db.MaxBatchSize) {
		// There is no existing batch, or the existing batch is full; start a new one.
		db.batch = &batch{
			db: db,
		}
		db.batch.timer = time.AfterFunc(db.MaxBatchDelay, db.batch.trigger)
	}

with a heavy insert load like 256+ go routines doing db.Batch and if db.MaxBatchSize is set to something low like 3 ( so that it will most likely trigger before the delayed timer) and MaxBatchDelay is higher than default, like 1000 ms or worse 30seconds, we always create a new db.Batch and call trigger with AfterFunc. there can be an undefined open/waiting amount of timers?

in line
we stop the timer with b.timer.Stop() but if you debug the b.timer right before stop() with '%#v' you see status "0x0" which i believe indicates, that the timer never created? do we have the right pointer there?

#GPT said
		 *   0x0 (0): The timer is idle or inactive.
		 *   0x1 (1): The timer is active and waiting to fire.
		 *   0x2 (2): The timer has been stopped.
		 *   0x3 (3): The timer has fired.

now i think when the hardware hits write limits of the disks, things go nuts.
writes take longer, timers build up, because we always create a new batch and fire it async.

when MaxBatchSize triggered in L930, we hit this condition many times in line
(if you add this debug there, you will see it)

	if b.db.batch == b {
		//log.Printf("INFO db.go: (121a) run() b.db.batch==b\n\n  b.db.batch='%#v'\n\n  b='%#v'\n\n", b.db.batch, b)
		b.db.batch = nil
	} else {
		// TODO DEBUG THIS
		log.Printf("WARN db.go: (121b) run() b.db.batch != b ???\n\n  b.db.batch='%#v'\n\n  b='%#v'\n\n", b.db.batch, b)
		//os.Exit(121)
	}

if only change the condition in the first quote to this:

	if db.batch == nil {

to only create a new batch if there is no open one, i can see db.batch.calls number rising higher than batchSize but the debug 'else' in "if b.db.batch == b {" never hits.

still not sure, maybe this is intended but i believe, (can be wrong), these infinite writes under some circumstances could be related to this?

replace the Batch with this very verbose debug version and see how it exits with these loglines

2023/10/24 14:18:55 1698149935390088386 db.batch=nil creates new
2023/10/24 14:18:55 1698149935390099527 overwrites? check batch.calls=3
2023/10/24 14:18:55 info db.go: trigger() b.date=1698149935390088386 age=(0 mils) b.calls=3
2023/10/24 14:18:55 1698149935390099527 overwrites calls? (db.batch.calls=0) this is zero now but was (3) before?
2023/10/24 14:18:55 1698149935390099527 ERROR? overwrites=true after append to: db.batch.calls 
             len_initial=3 len_overwrites=0 before=0 after=1

flip the 'const BUGFIX bool = true' and see it not exit ❔

type batch struct {
	db    *DB
	timer *time.Timer
	start sync.Once
	calls []call
	date  int64 // debugs age in trigger
}

const BUGFIX bool = false
func (db *DB) Batch(fn func(*Tx) error) error {
	errCh := make(chan error, 1)

	trace := time.Now().UnixNano()
	overwrites := false

	// ORIGINAL WITH DEBUGS ADDED
	len_initial := -1
	len_overwrites := -1
	len_before_append := -1
	len_after_append := -1
	db.batchMu.Lock()
	if (db.batch == nil) || (!BUGFIX && db.batch != nil && len(db.batch.calls) >= db.MaxBatchSize) {

		// There is no existing batch, or the existing batch is full; start a new one.
		if db.batch == nil {
			log.Printf("%d db.batch=nil creates new", trace)
		} else if len(db.batch.calls) > 0 {
			len_initial = len(db.batch.calls)
			log.Printf("%d overwrites? check batch.calls=%d", trace, len_initial)
			overwrites = true
		}
		db.batch = &batch{
			db: db,
			date: trace, // debugs age in trigger
		}
		db.batch.timer = time.AfterFunc(db.MaxBatchDelay, db.batch.trigger)
		//log.Printf("Launch timer db.batch.timer='%#v'", db.batch.timer)
		/*
		 * Launch timer db.MaxBatchDelay=10000000000
		 *    db.batch.timer='&time.Timer{
		 *       C:(<-chan time.Time)(nil),
		 *       r:time.runtimeTimer{
		 *           pp:0xc00002a000, when:1448181161246213, period:0,
		 *           f:(func(interface {}, uintptr))(0x4abca0),
		 *           arg:(func())(0x4fd420), seq:0x0, nextwhen:0,
		 *           status:0x1
		 *       }}'
		 */
		if overwrites {
			len_overwrites = len(db.batch.calls)
			log.Printf("%d overwrites calls? (db.batch.calls=%d) this is zero now but was (%d) before?", trace, len_overwrites, len_initial)
			//time.Sleep(time.Second*5)
		}
	}
	len_before_append = len(db.batch.calls)
	if overwrites {
		//log.Printf("%d overwrites before append to: db.batch.calls=%d", trace, len_before)
	}
	db.batch.calls = append(db.batch.calls, call{fn: fn, err: errCh})
	len_after_append = len(db.batch.calls)
	if overwrites && len_initial != len_before_append {
		log.Printf("%d ERROR? overwrites=%t after append to: db.batch.calls len_initial=%d len_overwrites=%d before=%d after=%d", trace, overwrites, len_initial, len_overwrites, len_before_append, len_after_append)
		os.Exit(66)
	}
	if len(db.batch.calls) >= db.MaxBatchSize {
		// wake up batch, it's ready to run
		// DEBUG
		//if len(db.batch.calls) > db.MaxBatchSize {
			//log.Printf("db.go: wakeup batch.calls=%d / db.MaxBatchSize=%d => trigger", len(db.batch.calls), db.MaxBatchSize)
		//}
		go db.batch.trigger()
	} else {
		//log.Printf("db.go: nowakeup batch.calls=%d / db.MaxBatchSize=%d db.MaxBatchDelay=%d", len(db.batch.calls), db.MaxBatchSize, db.MaxBatchDelay)
	}
	db.batchMu.Unlock()

	err := <-errCh
	if err == trySolo {
		err = db.Update(fn)
	}
	return err
}

// trigger runs the batch if it hasn't already been run.
func (b *batch) trigger() {
	kill := false
	if b == nil {
		log.Printf("WARN trigger() b=nil")
		kill = true
	} else
	if b.db == nil {
		log.Printf("WARN trigger() b.db=nil")
		kill = true
	} else
	if b.db.batch == nil {
		log.Printf("WARN trigger() b.db.batch=nil")
		//kill = true
	} else
	if b.db.batch.calls == nil {
		log.Printf("WARN trigger() b.db.batch.calls=nil")
		//kill = true
	} else
	if len(b.db.batch.calls) == 0 {
		log.Printf("WARN trigger() b.db.batch.calls empty")
		//kill = true
	}
	log.Printf("info db.go: trigger() b.date=%d age=(%d mils) b.calls=%d", b.date, (time.Now().UnixNano()-b.date)/1e6, len(b.calls))
	if kill {
		os.Exit(99)
	}
	b.start.Do(b.run)
}

maybe adding a debugging counter into it to get an idea of how many timers are running/waiting.
things go too fast, not an easy task to debug this 👯
and you need hi load up to the point where disks can't write faster and things may build up.

gogogo

@go-while
Copy link
Author

go-while commented Oct 24, 2023

maybe i got something that can help

`
###DEBUG START 1698163402996429837###
2023/10/24 18:03:23 1698163402996429837 WARN db.go: (121b) run() b.db.batch != b ???
2023/10/24 18:03:23 1698163402996429837 !!! b.db.batch='&bbolt.batch{db:(*bbolt.DB)(0xc0097d2fc0), timer:(*time.Timer)(0xc00ba2e2d0), start:sync.Once{done:0x0, m:sync.Mutex{state:0, sema:0x0}}, calls:[]bbolt.call{bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a2ae0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a2ba0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a2c60)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a2d20)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a2de0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a2ea0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a2f60)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3020)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a30e0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a31a0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3260)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3320)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a33e0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a34a0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3560)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3620)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a36e0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a37a0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3860)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3920)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a39e0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3aa0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3b60)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3c20)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3ce0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3da0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3e60)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136a3f20)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4000)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d40c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4240)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4300)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d43c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4480)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4540)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4600)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d46c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4780)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4840)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4900)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d49c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4a80)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4b40)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4c00)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4cc0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4d80)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4e40)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4f00)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d4fc0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d5080)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d5140)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d5200)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136d52c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0139d12c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0139d1da0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0139fd800)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc013ab4b40)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259aa0)}}, date:1698163402896434042, id:0x4}'

2023/10/24 18:03:23 1698163402996429837 !!! b='&bbolt.batch{db:(*bbolt.DB)(0xc0097d2fc0), timer:(*time.Timer)(0xc013319450), start:sync.Once{done:0x0, m:sync.Mutex{state:1, sema:0x0}}, calls:[]bbolt.call{bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc01332dec0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc01332df80)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc01333cba0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc01333cc60)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259380)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259620)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0122596e0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0122597a0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259860)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259920)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0122599e0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259b00)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259b60)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259c20)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259ce0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259da0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259e60)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012259f20)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274000)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0122740c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0122743c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274480)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274780)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274840)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274900)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274a80)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274b40)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274c00)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274cc0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274d80)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274e40)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274f00)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012274fc0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275080)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275140)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275200)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0122752c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275380)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275440)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275500)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0122755c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275680)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275740)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275800)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0122758c0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275980)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275a40)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275b00)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275bc0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275c80)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275d40)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275e00)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275ec0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc012275f80)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc013682060)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc013682120)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136821e0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136822a0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc013682360)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc013682420)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136824e0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc0136825a0)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc013682660)}, bbolt.call{fn:(func(*bbolt.Tx) error)(0x662b80), err:(chan<- error)(0xc01360f200)}}, date:1698163402846272732, id:0x1}'

2023/10/24 18:03:23 1698163402996429837 !3! b.db.batch.id=4 != b.id=1 ???

###DEBUG END 1698163402996429837###
`

i guess my idea was somehow right: batches can build up

patching debugs into this branch

testing with this now

./race.sh or go run nntp-history-test.go or ./nntp-history-test if compiled
runs 4 equal jobs with ints => sha256: resulting in 1million inserts.
a re-run scans 1024*1024 entries for missing hashes

just fixed some data races introduced by log.prints. 💃

if anyone can pinpoint me to a internal test, i'd run this too.

running it in "race" or "go run" mode is not fast enough to trigger a build up but i can see some with the compiled one.

2023/10/24 18:29:10 1698164950185247089 !3! b.db.batch.id=40 != b.id=38 ???

2023/10/24 18:44:44 1698165884676286942 !9! b.db.batch.id=26 != b.id=25 diff=1 ???
2023/10/24 18:44:44 1698165884686400208 !9! b.db.batch.id=31 != b.id=19 diff=12 ???
2023/10/24 18:44:44 1698165884723785458 !9! b.db.batch.id=43 != b.id=42 diff=1 ???
2023/10/24 18:44:44 1698165884731057101 !9! b.db.batch.id=35 != b.id=21 diff=14 ???
2023/10/24 18:44:44 1698165884732422160 !9! b.db.batch.id=33 != b.id=19 diff=14 ???
2023/10/24 18:44:44 1698165884748080688 !9! b.db.batch.id=36 != b.id=24 diff=12 ???
2023/10/24 18:44:44 1698165884753737246 !9! b.db.batch.id=38 != b.id=20 diff=18 ???

note that my test app runs with 16 dbs at a time
they dont interact with each other but all spam the log, so numbers are from different dbs at the same time and ids can differe over lines

// edit
i guessed this simple counter hurts performance too: drops by 70%. to only 30k tx/sec <<--
but not. i have MaxBatchSize at 3 xD

... and the getNextID which i placed into the debug tree is in the batchMU mutex, should not hurt anyways.

performance with getNextID is normal 100k+ tx/sec

@go-while
Copy link
Author

go-while commented Oct 24, 2023

i don't know how it works on ext4 or any other filesystem.
because, in iotop i can see the app writing with insane bandwidth but zfs does show nothing of this.
frequently flushes to nvme storage.

1560679 be/4 user         0.00 B/s   38.66 M/s  ?unavailable?  ./nntp-history-test -numcpu=4 -BoltDB_PageSize=16 -BoltDB_MaxBatchSize=3 -BoltDB_MaxBatchDelay=1 -BatchFlushEvery=100 -RootBUCKETSperDB=256 -todo 100000000 -p 4 -keyindex=2 -pprofcpu=true -pprofmem=true -pprof localhost:1234 -NoReplayHisDat=false
1560680 be/4 user         0.00 B/s   46.45 M/s  ?unavailable?  ./nntp-history-test -numcpu=4 -BoltDB_PageSize=16 -BoltDB_MaxBatchSize=3 -BoltDB_MaxBatchDelay=1 -BatchFlushEvery=100 -RootBUCKETSperDB=256 -todo 100000000 -p 4 -keyindex=2 -pprofcpu=true -pprofmem=true -pprof localhost:1234 -NoReplayHisDat=false
1560681 be/4 user         0.00 B/s  200.80 M/s  ?unavailable?  ./nntp-history-test -numcpu=4 -BoltDB_PageSize=16 -BoltDB_MaxBatchSize=3 -BoltDB_MaxBatchDelay=1 -BatchFlushEvery=100 -RootBUCKETSperDB=256 -todo 100000000 -p 4 -keyindex=2 -pprofcpu=true -pprofmem=true -pprof localhost:1234 -NoReplayHisDat=false
1560683 be/4 user         0.00 B/s  206.21 M/s  ?unavailable?  ./nntp-history-test -numcpu=4 -BoltDB_PageSize=16 -BoltDB_MaxBatchSize=3 -BoltDB_MaxBatchDelay=1 -BatchFlushEvery=100 -RootBUCKETSperDB=256 -todo 100000000 -p 4 -keyindex=2 -pprofcpu=true -pprofmem=true -pprof localhost:1234 -NoReplayHisDat=false
1564523 be/4 user         0.00 B/s  165.91 M/s  ?unavailable?  ./nntp-history-test -numcpu=4 -BoltDB_PageSize=16 -BoltDB_MaxBatchSize=3 -BoltDB_MaxBatchDelay=1 -BatchFlushEvery=100 -RootBUCKETSperDB=256 -todo 100000000 -p 4 -keyindex=2 -pprofcpu=true -pprofmem=true -pprof localhost:1234 -NoReplayHisDat=false
1665661 be/4 user         0.00 B/s  172.42 M/s  ?unavailable?  ./nntp-history-test -numcpu=4 -BoltDB_PageSize=16 -BoltDB_MaxBatchSize=3 -BoltDB_MaxBatchDelay=1 -BatchFlushEvery=100 -RootBUCKETSperDB=256 -todo 100000000 -p 4 -keyindex=2 -pprofcpu=true -pprofmem=true -pprof localhost:1234 -NoReplayHisDat=false

my problem is now that i hardly try to reproduce the problem i had but since i re wrote a lot of my flushing code, i can't get it to infinite writes anymore... at the moment. still trying to reproduce. what i can get is longer zfs flushes but not continues any more xD

but these accumulations are happening and i'm almost 100% sure it has to do something with this.

2023/10/24 20:08:06 1698170886015972085 !9! b.db.batch.id=25873 != b.id=25823 diff=50 ???
2023/10/24 20:08:06 1698170886016537722 !9! b.db.batch.id=25919 != b.id=25869 diff=50 ???
2023/10/24 20:08:06 1698170886018185658 !9! b.db.batch.id=25911 != b.id=25861 diff=50 ???
2023/10/24 20:08:06 1698170886019904352 !9! b.db.batch.id=25919 != b.id=25868 diff=51 ???
2023/10/24 20:08:11 1698170891185435838 !9! b.db.batch.id=27566 != b.id=27513 diff=53 ???
2023/10/24 20:08:11 1698170891418007268 !9! b.db.batch.id=27770 != b.id=27720 diff=50 ???

@go-while
Copy link
Author

go-while commented Oct 25, 2023

this debugging branch is a bit messed up but maybe this works. thus don't know if anyone can follow. 👯

here this line 927 Batch() is a clean approach i'd suggest. with a batchProcessor() below, no time.AfterFunc but a static ticker. and overshooting does not happen. which is kind of hard to debug in this link, as all counters and debugs a use are missing 🍡
commit

go batchProcessor() needs it's place right before the return in db.go Open()

@go-while
Copy link
Author

go-while commented Oct 25, 2023

every day with bbolt is a good day but there is still some logical issue i have and i'm not happy with this idea either.

(initially) we call(ed) trigger in a go routine to "wake up", but this can't run before we released the lock because it runs into a lock too.

i understand that there was the intention to not break (or delay) running batches as the comment in run() says, but this, under some rare circumstances can lead to go routines building up.. i can see some, but not that much but i know a bug is in there. just because i can't trigger it atm, does not mean it won't hit sometime in future.

another thing we should talk about is the run() function.
it locks, checks if the batch is itself, nils it, unlocks and does the db works outside of the mutex.
so there can be an undefined number of N go routines hitting the db with db update calls, again running into the beginRWTx mutex.

if the storage can't write fast enough, the Batch() func always creates new batches and fires them out to the time.AfterFunc.

further the comment // points to us, and we hold the mutex anyway. is wrong?
we don't hold the mutex at this point or am i wrong. the pointer is nil yes.

☕ please, thanks!

@go-while
Copy link
Author

go-while commented Oct 26, 2023

The current implementation of MaxBatchDelay and MaxBatchSize does not seem to function as expected.
The existing solution triggers the creation of new batches whenever MaxBatchSize is reached and processes them in the background. On the other side, if MaxBatchsize is not reached, the batches will execute with time.AfterFunc when MaxBatchDelay triggers. While this approach introduces a delay, but when handling high throughput, the fired timers accumulate at the db.rwmutex. Consequently, there is no way to have a consistent and real delay between flushes;

I've been working on a potential solution, here's how the updated process works:
A new variable named MaxBatchQueue has been added to control the number of pre-queued batches.
Debug points have been integrated to offer real-time insight into how MaxBatchSize, MaxBatchDelay, and MaxBatchQueue are functioning.
This will provide a clearer understanding of the batch process, especially when dealing with high throughput to find good values.

When Batch() creates a new batch because the existing batch exceeds MaxBatchSize, the old (full) batch (pointer to calls) is passed to the db.batCH channel in BatchProcessor.
The db.batCH channel is limited by MaxBatchQueue and only accepts, or is used for, full batches (those that reach the MaxBatchSize).
BatchProcessor has a timer.C that triggers every MaxBatchDelay and passes the existing db.batch.calls to runBatch().
The timer.C in BatchProcessor is reset after every write or when there's no work to be done.

This approach ensures that the delay between flushes works as intended without interrupting batches, as long as MaxBatchSize does not trigger.
If MaxBatchSize is reached, full batches are processed without delays, one by one.
This eliminates the issue of accumulating undefined amounts of AfterFuncs waiting in the background for the mutex to release and fire consecutively.
If storage is slower, it will naturally slow down as the queue becomes full.

full commit in this branch batchProcessor-t2

testing now

if you got an email with a different text... don't ask... cat jumped table while i tried to hit edit.. she landed at enter! unreal...

@github-actions github-actions bot added the stale label Apr 16, 2024
@ahrtr ahrtr removed the stale label May 10, 2024
@go-while go-while closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants