Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk indexer is inefficient and slow #528

Closed
schorsch opened this issue Sep 10, 2022 · 4 comments
Closed

Bulk indexer is inefficient and slow #528

schorsch opened this issue Sep 10, 2022 · 4 comments

Comments

@schorsch
Copy link

While it may be a nice idea for some codebases to use structs with json conversion, it is not the best idea when adding a couple of 100.000 entries.
In my case, to give you a rough idea, the import times for ~350.000 entries(from csv files) with ~10-20 short fields and no extensive analyzer applied, into 5 different indices takes >7m, on a recent Linux Dell Precision 5550. I'll spare you the bulk size, flush and other details because they dont change much for the basic problem.

I used oliveres elastic client, which is structured internally to also accept simple strings for bulk-requests see the Source method call here. With this in place i was able to build the bulk request body rows with some fmt.Sprintf() calls and preventing struct initialization and with it cpu & memory allocation. With such handling the import took ~6 secs. Before that i used oliveres method with struct handling and it was ~ 12 secs.
Yes, building a string which is valid JSON may be inkonvenient in some situations, but a decent programmer should be able to achieve this. And we are talking about bulk inserts which are probably used in situation where you need speed. When i see the code in your bulk_indexer worker run and dive into the meta-line i am pretty stunned of how difficult it can be to create something wrapped by braces and joined with commas.

To wrap it up, your bulk index is pretty much useless unless you are inserting only 100 entries or you opt into throwing hardware at the problem(which will probably NOT work). I am not an ultra experienced go dev, so i may have missed some low level methods in here. My advice would be to look at oliveres code-structures and adopt some of his bulk-ideas, also because your open bulk related tickets dont look promising. Unfortunately he decided to not continue his lib, so one may need to fork it and make it work with ES 8+

@Anaethelion
Copy link
Contributor

Hi @schorsch

The performance you describe is definitively not normal.
Could you share a bit of code on how you used the bulk indexer ?

You mention building the body rows with some fmt.Sprintf, you definitively can do that with the bulk indexer.

@schorsch
Copy link
Author

We are about to implement some benchmarks and dive into profiling, as such is always a good thing to learn. I'll share some examples or maybe Dont's as soon as we have identified possible bottlenecks.
Thanks for your hint to the example, i looked it up and we already used it this way.

@Anaethelion
Copy link
Contributor

Closing this issue.

I'll be happy to reopen if you come back with code and/or examples !

@schorsch
Copy link
Author

We did evaluate our code, and this makes me officially and idiot now!

While reviewing the code, which lead to this ticket, i oversaw that the FlushInterval was set to 1 second, which made the loop take forever.

We bench marked your lib against olivers and guess what .. it outperformed it. The cpu profiling showed ~28% less memory allocations and ran **~ 25% faster**

no more word than, sorry for wasting your time!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants