Skip to content
This repository has been archived by the owner on Sep 23, 2023. It is now read-only.

ETL load does not support 0 value keys #766

Open
elee1766 opened this issue Dec 5, 2022 · 3 comments
Open

ETL load does not support 0 value keys #766

elee1766 opened this issue Dec 5, 2022 · 3 comments

Comments

@elee1766
Copy link
Contributor

elee1766 commented Dec 5, 2022

first time using etl just now- I was scratching my head over why Load was not working, it was because i was trying to load empty values, possible with direct transaction, but seems currently seems in ETL framework it is used as the delete identifier.

https://github.com/ledgerwatch/erigon-lib/blob/main/etl/collector.go#L239

One possible solution - add TransformArg "AllowEmptyValues" to skip this check, and allow therefore "zero values"? this is something supported by MDBX. Happy to make a PR if it seems like a good idea.

example use case, filling a database with a large amount of random empty keys for performance benchmark:

func (d *Database) AddKeys(ctx context.Context, bucket string, bytesize int, keysize int) error {
	src := frand.New()
	cl := etl.NewCollector("mdbxbench", os.TempDir(), etl.NewOldestEntryBuffer(etl.BufferOptimalSize*5))
	for i := 0; i < (bytesize / keysize); i++ {
		o := make([]byte, keysize)
		src.Read(o)
		err := cl.Collect(o[:], []byte{})
		if err != nil {
			return err
		}
	}
	defer cl.Close()
	err := d.DB().Update(ctx, func(tx kv.RwTx) error {
		return cl.Load(tx, bucket, func(k, v []byte, _ etl.CurrentTableReader, next etl.LoadNextFunc) error {
			return next(k, k, []byte{}) /// it would be good if this actually added the entry, instead of deleting. For now i am just doing []byte{0}
		}, etl.TransformArgs{})
	})
	if err != nil {
		return err
	}
	return nil
}
@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Dec 5, 2022

It's maybe-big task. Next things needs to be done:

  • check that mdbx-go bindings do support empty values - maybe not (because LMDB didn't support them). Sub-tasks - can store empty value in: value, key, value of DupSort table.
  • check that kv_mdbx.go does support empty values
  • check that kv_remote and remotedbserver do support empty values
  • ensure that empty value it's not nil. That all API's (Seek, SeekBothRange, Get, Next, First, Last, etc...)- if you put empty value then return empty value not nil. likely need write tests for this special case - because it's easy to break.
  • etl has 3 collectors all of them need support empty value
  • etl now implemented next way: nil-value means delete this key. Need keep this feature. etl doesn't support deletes from DupSort tables.
  • take a look into rawdb package - maybe there are checks like if len(k) == 0: nil and empty value will pass this check.

@elee1766
Copy link
Contributor Author

elee1766 commented Dec 6, 2022

Hm, I will take one step at a time. and update this post as I investigate

* check that mdbx-go bindings do support empty values - maybe not (because LMDB didn't support them). Sub-tasks - can store `empty value` in: value, key, value of DupSort table.

it seems https://github.com/torquem-ch/mdbx-go/blob/master/mdbx/txn.go#L548 does not support empty values. replaces 0 length with a []byte{0}, then passes to Put. it seems in MDBX - reserve.iov_len = (data ? data->iov_len : 0) + sizeof(mdbx_attr_t); is used, implying that if NULL is passed as data, that would be how to set an empty value. Assuming key is the same

* check that kv_mdbx.go does support empty values

seems it passes through https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L1109, https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L1374 all the way down, so it does support empty values

as for empty keys, seems it is properly handling it, at least here: https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L535. tests will need to be written

TODO:

* check that kv_remote and remotedbserver do support empty values

* ensure that `empty value` it's not `nil`. That all API's (Seek, SeekBothRange, Get, Next, First, Last, etc...)- if you put `empty value` then return `empty value` not `nil`. likely need write tests for this special case - because it's easy to break.

* etl has 3 collectors all of them need support `empty value`

* etl now implemented next way: `nil`-value means delete this key. Need keep this feature. etl doesn't support deletes from DupSort tables.

* take a look into `rawdb` package - maybe there are checks like `if len(k) == 0`: `nil` and `empty value` will pass this check.

@AskAlexSharov
Copy link
Collaborator

FYI: you can ignore b.AutoDupSortKeysConversion == true case

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants