Simplified DiskWriterQueue with blocking concurrency #2411

ltetak · 2024-01-25T10:26:15Z

It is relatively easy to put the DiskWriterQueue into a state where it does nothing. It is caused by mismatches where the logic does not track properly which _task is the current one. It has many problems:

Wait() waits for a wrong _task
_Task is not started at all

e.g.
#2307

My repro steps were to run a lot of Inserts and Deletes in parallel (to fill up the disk queue). Then every couple of seconds run _db.Checkpoint() to force full db lock and Wait() invocation.

Fix is to use a much simpler blocking approach (one thread is dedicated to this). It is a good tradeoff IMO for now. It can be later replaced with an awaitable mutex version.
Edit: I added an async version of the semaphore which does not block the thread.

mbdavid · 2024-02-13T19:11:34Z

Thanks! This are an old code that must be updated.

jdtkw · 2024-03-06T06:31:01Z

Thanks @ltetak - this indeed resolved our isue (#2307 - I work with @dgodwin1175), but v5.0.18 and v5.0.19 causes us to hit #2435 prior to being able to validate this with an official build. A custom build of #2436 on top of v5.0.19 (that includes #2411) seems to indicate that we can have a stable solution.

ltetak · 2024-03-06T07:42:33Z

hi @jdtkw, transaction (and especially AutoTransaction class) was the next thing I wanted to take a look at. I know about a couple of problems there.

AutoTransaction can fail when reverting the transaction - this is bad by itself but it's double-bad because it hides the original exception.
Error handling in transactions is wrong causing wrong counts. Fix #2435 Transactions are not removed in LiteDB 5.0.18 #2436 may be a fix to it but we need to be sure the DB is in a good state. There are a lot of "ENSURE" errors. My guess is that some transaction does not return the DB to a valid state and it breaks it.
We run the database in single threaded mode (we serialize every access to the db by locks) so it must be either a problem in the algorithm somewhere or some external exception. I have some evidence that external exceptions make this problem much worse so I would start there - it means if you have an unstable storage medium causing random exceptions it may lead to a corrupted database (which should not happen thanks to the journal approach).

ltetak mentioned this pull request Jan 25, 2024

Disk problems #2412

Merged

Simplified DiskWriterQueue with blocking concurrency

33f85d5

ltetak force-pushed the diskwriterqueue branch from 23c3a07 to 33f85d5 Compare January 25, 2024 12:01

Async DiskWriterQueue implementation

f21cd84

mbdavid merged commit 6d2a165 into mbdavid:master Feb 13, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplified DiskWriterQueue with blocking concurrency #2411

Simplified DiskWriterQueue with blocking concurrency #2411

ltetak commented Jan 25, 2024 •

edited

mbdavid commented Feb 13, 2024

jdtkw commented Mar 6, 2024

ltetak commented Mar 6, 2024

Simplified DiskWriterQueue with blocking concurrency #2411

Simplified DiskWriterQueue with blocking concurrency #2411

Conversation

ltetak commented Jan 25, 2024 • edited

mbdavid commented Feb 13, 2024

jdtkw commented Mar 6, 2024

ltetak commented Mar 6, 2024

ltetak commented Jan 25, 2024 •

edited