Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults in moka-cht under heavy workloads on a many-core machine #34

Closed
tatsuya6502 opened this issue Sep 5, 2021 · 15 comments · Fixed by #157
Closed

Segmentation faults in moka-cht under heavy workloads on a many-core machine #34

tatsuya6502 opened this issue Sep 5, 2021 · 15 comments · Fixed by #157
Assignees
Labels
bug Something isn't working
Milestone

Comments

@tatsuya6502
Copy link
Member

tatsuya6502 commented Sep 5, 2021

I have seen segmentation faults a few times when I am running mokabench on Moka v0.5.1. It seems it is randomly happening while get_or_insert_with method is heavily called concurrently from many threads.

+ ./target/release/mokabench --enable-invalidate-entries-if --enable-insert-once
Cache, Max Capacity, Clients, Inserts, Reads, Hit Rate, Duration Secs
Moka Unsync Cache, 100000, -, 14696832, 31104534, 52.750, 8.575
Moka Cache, 100000, 16, 15550290, 31954711, 51.336, 17.365
Moka Cache, 100000, 24, 15543954, 31948375, 51.347, 17.743
Moka Cache, 100000, 32, 15527876, 31932297, 51.373, 17.877
./run-tests.sh: line 36: 21740 Segmentation fault      (core dumped) ./target/release/mokabench --enable-invalidate-entries-if --enable-insert-once

I am using Amazon EC2 for running mokabench. After spending few days, I found it is related to the version of crossbeam-epoch and number of CPU cores.

Segfaults? Moka cht/moka-cht crossbeam-epoch EC2 Instance Type Arch vCPUs OS
Yes v0.5.1 moka-cht v0.5.0 v0.9.5 c5.9xlarge x86_64 36 Amazon Linux 2
No v0.5.1 cht v0.4.1 v0.8.2 c5.9xlarge x86_64 36 Amazon Linux 2
No v0.5.1 moka-cht v0.5.0 v0.9.5 c5.4xlarge x86_64 16 Amazon Linux 2

crossbeam-epoch is used by moka-cht, the concurrent hash table use by Moka.

I examined stack traces from core dumps and found there are two patterns. I have not identified the root cause yet. Perhaps a crossbeam_epoch::Owned<T>, which is very similar to Box<T>, stored in moka-cht became a dangling pointer by some reason?

Pattern 1: At Arc::ne() (Click to expand)
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055cd7249862e in <alloc::sync::Arc<T> as alloc::sync::ArcEqIdent<T>>::ne ()
    at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2095
2095	/rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs: No such file or directory.
[Current thread is 1 (Thread 0x7fe61d1e8700 (LWP 7009))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /data/core-dumps/mokabench-copy/target/release/mokabench.
Use `info auto-load python-scripts [REGEXP]' to list them.
Missing separate debuginfos, use: debuginfo-install glibc-2.26-48.amzn2.x86_64 libgcc-7.3.1-13.amzn2.x86_64
(gdb) bt
#0  0x000055cd7249862e in <alloc::sync::Arc<T> as alloc::sync::ArcEqIdent<T>>::ne ()
    at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2095
#1  <alloc::sync::Arc<T> as core::cmp::PartialEq>::ne () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2141
#2  core::cmp::impls::<impl core::cmp::PartialEq<&B> for &A>::ne () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/cmp.rs:1356
#3  moka_cht::map::bucket::BucketArray<K,V>::insert_or_modify::{{closure}} ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:255
#4  moka_cht::map::bucket::BucketArray<K,V>::probe_loop ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:367
#5  moka_cht::map::bucket::BucketArray<K,V>::insert_or_modify ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:248
#6  0x000055cd72476961 in moka_cht::map::bucket_array_ref::BucketArrayRef<K,V,S>::insert_with_or_modify_entry_and ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket_array_ref.rs:191
#7  0x000055cd7248d19a in moka_cht::segment::map::HashMap<K,V,S>::insert_with_or_modify_entry_and ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:933
#8  moka_cht::segment::map::HashMap<K,V,S>::insert_with_or_modify ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:798
#9  moka::sync::value_initializer::ValueInitializer<K,V,S>::try_insert_waiter ()
    at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/value_initializer.rs:108
#10 0x000055cd7248cdf8 in moka::sync::value_initializer::ValueInitializer<K,V,S>::init_or_read ()
    at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/value_initializer.rs:42
#11 0x000055cd72492f74 in moka::sync::cache::Cache<K,V,S>::get_or_insert_with_hash_and_fun ()
    at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/cache.rs:277
#12 moka::sync::cache::Cache<K,V,S>::get_or_insert_with () at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/cache.rs:264
#13 0x000055cd7248f90d in mokabench::cache::sync_cache::SyncCache::get_or_insert_with () at src/cache/sync_cache.rs:43
#14 <mokabench::cache::sync_cache::SyncCache as mokabench::cache::CacheSet<mokabench::parser::ArcTraceEntry>>::get_or_insert_once ()
    at src/cache/sync_cache.rs:79
#15 0x000055cd7246eb87 in <mokabench::cache::sync_cache::SharedSyncCache as mokabench::cache::CacheSet<mokabench::parser::ArcTraceEntry>>::get_or_insert_once
    () at src/cache/sync_cache.rs:125
#16 mokabench::process_commands () at src/lib.rs:107
...
Pattern 2: At atomic_sub() in Arc::drop() (Click to expand)
Program terminated with signal SIGSEGV, Segmentation fault.
#0  core::sync::atomic::atomic_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:2401
2401	/rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs: No such file or directory.
[Current thread is 1 (Thread 0x7f6e0f9b2900 (LWP 32108))]
Missing separate debuginfos, use: debuginfo-install glibc-2.26-48.amzn2.x86_64 libgcc-7.3.1-13.amzn2.x86_64
(gdb) bt
#0  core::sync::atomic::atomic_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:2401
#1  core::sync::atomic::AtomicUsize::fetch_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:1769
#2  <alloc::sync::Arc<T> as core::ops::drop::Drop>::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:1558
#3  core::ptr::drop_in_place<alloc::sync::Arc<usize>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#4  core::ptr::drop_in_place<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#5  core::ptr::drop_in_place<alloc::boxed::Box<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#6  core::mem::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/mem/mod.rs:889
#7  <T as crossbeam_epoch::atomic::Pointable>::drop ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/atomic.rs:212
#8  <crossbeam_epoch::atomic::Owned<T> as core::ops::drop::Drop>::drop ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/atomic.rs:1087
#9  core::ptr::drop_in_place<crossbeam_epoch::atomic::Owned<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#10 core::mem::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/mem/mod.rs:889
#11 moka_cht::map::bucket::defer_acquire_destroy::{{closure}} ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:684
#12 crossbeam_epoch::guard::Guard::defer_unchecked ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/guard.rs:195
#13 moka_cht::map::bucket::defer_acquire_destroy () at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:682
#14 <moka_cht::segment::map::HashMap<K,V,S> as core::ops::drop::Drop>::drop ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:1032
#15 0x000055db206daf73 in core::ptr::drop_in_place<moka_cht::segment::map::HashMap<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#16 core::ptr::drop_in_place<moka::future::value_initializer::ValueInitializer<usize,alloc::sync::Arc<alloc::boxed::Box<[u8]>>,std::collections::hash::map::RandomState>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#17 alloc::sync::Arc<T>::drop_slow () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:1051
#18 0x000055db206ea837 in mokabench::run_multi_tasks::{{closure}} () at /home/ec2-user/mokabench/src/lib.rs:314
@tatsuya6502 tatsuya6502 self-assigned this Sep 5, 2021
@tatsuya6502 tatsuya6502 added the bug Something isn't working label Sep 5, 2021
@tatsuya6502
Copy link
Member Author

Based on my test results, it might be worth to downgrade crossbeam-epoch from v0.9.5 to v0.8.2 to workaround the issue. I am preparing Moka v0.5.2 release with moka-cht v0.4.2 and crossbeam-epoch v0.8.2.

@tatsuya6502
Copy link
Member Author

Released Moka v0.5.2 with moka-cht v0.4.2 and crossbeam-epoch v0.8.2.

Unfortunately, the same segmentation fault (the pattern 1) occurred when I was running mokabench on Moka v0.5.2. I released v0.5.2 anyway as earlier versions of Moka may have the same issue already, and I feel segmentation faults is less frequent with crossbeam-epoch v0.8.2.

@tatsuya6502
Copy link
Member Author

Just for sure, I tried Rust 1.53.0 to compile mokabench + Moka v0.5.3. I did it because I have never tried Rust 1.53.0 since Moka v0.5.1 was released. The result was the same; it got a segmentation fault after running mokabench for ~2 hours. I used the EC2 instance type with 36 vCPUs.

Segfaults? Rust Moka crossbeam-epoch vCPUs
yes 1.53.0 v0.5.3 v0.8.2 36
yes 1.54.0 v0.5.2 v0.8.2 36
yes 1.55.0 v0.5.3 v0.8.2 36

@lpi
Copy link

lpi commented Oct 27, 2021

Any progress on this? Does it ever happen on something with 18 vCPUs?

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented Oct 28, 2021

Any progress on this?

No 😞. I spent a few more days for running different tests, doing code review, etc., but could not find any clue.

I am currently constraint by time (I have to run the test at least for a few hours to reproduce) and money (36 vCPU instance is expensive; $1.926/hour). I will revisit this issue when I have more time.

Does it ever happen on something with 18 vCPUs?

No. It has never happened on a 18 16 vCPUs instance in my tests. Also, no Moka users have reported this or similar problems.

Are you holding off on using Moka because of this problem? If so, perhaps I will add an optional Cargo feature to use an alternative hash table. It will spoil concurrent performance but will be safer.

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented Feb 20, 2022

Here are some updates on this issue.

It has been five moths since I first saw this issue, but (fortunately) no user of this crate has reported segfaults:

  • Segfaults have been occurring only in my testing environment (Amazon EC2) with 32 or more vCPUs.
  • In my testing environment, segfaults have been occurred only when the following methods are used:
    • get_or_insert_with
    • get_or_try_insert_with

On January 5th, 2022, I ran the same load tests (mokabench) against Moka v0.7.0 on the following EC2 instances and had some segfaults only on the instances with 32 vCPUs:

Moka Version Instance Type vCPUs Architecture OS Number of segfaults occurred
v0.7.0 c6i.8xlarge 32 vCPU x86_64 Amazon Linux 2 2 times in 4 hours
v0.7.0 c6g.8xlarge 32 vCPU AArch64 Amazon Linux 2 3 times in 4 hours
v0.7.0 c6i.4xlarge 16 vCPU x86_64 Amazon Linux 2 0 time in 4 hours

I ran the same but shorter load tests as a part of pre-release testing for v0.7.1 (January 12th, 2022) and v0.7.2 (February 6th, 2022). There was no segfault for v0.7.2:

Moka Version Instance Type vCPUs Architecture OS Number of segfaults occurred
v0.7.1 c6i.8xlarge 32 vCPU x86_64 Amazon Linux 2 1 time in 2.5 hours
v0.7.2 c6i.8xlarge 32 vCPU x86_64 Amazon Linux 2 0 time in 4 hours

v0.7.2 has fixes and enhancements for #72. It might have mitigated the issue but I am not 100% sure because I still have not figured out the root cause of those segfaults.

@Dessix
Copy link

Dessix commented Feb 28, 2022

MIRI or Loom may be able to spot the issue, if you use them to test the contracts of the internal HashMap implementation.

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented May 19, 2022

Here are some updates on this issue.

  • Segfaults are occurring only in my testing environments.
    • Nobody else has been reported this or similar issues.
  • With unmodified Moka's source codes, I need an Amazon EC2 instance with 32 or more vCPUs to reproduce this issue.
  • If I modify Moka's source codes to reduce the number of internal segments of our HashMap from 16 to 2 1, I can reproduce this issue with the following machines:
    • Mac mini M1 running macOS arm64. (4 × performance cores + 4 × efficiency cores)
    • QEMU on Mac mini M1 running Ubuntu Server Arm (AArch64). (4 × vCPUs)
  • I generate the workload using the mokabench program, with 36 to 48 client threads concurrently reading from and writing to one cache.
    • mokabench will repeat short (~15 seconds) but very intensive workload.
    • It usually takes 1 to 2 hours to reproduce the issue.

Our internal HashMap is lock-free container and heavily depends on atomic operations such as compare-and-swap (CAS). It seems parallelism is the key to trigger the issue; e.g. more than one processor cores to execute CAS on the same memory location at the same time. It also heavily depends on crossbeam-epoch's epoch-based memory reclamation (garbage collection, GC), which also relies on CAS.

I think the most suspicious area is rehashing, which is used to extend HashMap capacity and to run epoch-GC on deleted keys. There should be lots of CAS conflicts and retries, and epoch-GCs occurs during rehashing.

Action Plans

  1. To mitigate the issue, increase the number of the internal segments of our HashMap.
  2. Continue testing with different configurations to isolate the problem area:
    • e.g. Modify the codes to change rehashing behavior.
  3. Enable Loom testing:
    • This may require non trivial amount of work.
    • e.g. We will need to upgrade crossbeam-epoch from v0.8.3 to v0.9 to get Loom support (?)
  4. Enable Miri testing on the HashMap etc.
    • This may require non trivial amount of work too.
    • I already tried this in January 2022, but I could not get even single unit test to finish in ~10 hours. (Miri is very slow when testing multi-thread stuff)
    • We will need to reduce the number of threads and number of cache entries in each test until Miri can finish in a reasonable time frame.

@tatsuya6502
Copy link
Member Author

To mitigate the issue, increase the number of the internal segments of our HashMap.

This workaround is added via #129.

@SimonSapin
Copy link

SimonSapin commented Jun 28, 2022

Cargo.toml points here:

moka/Cargo.toml

Lines 52 to 55 in 8f61b35

# Although v0.8.2 is not the current version (v0.9.x), we will keep using it until
# we perform enough tests to get conformable with memory safety.
# See: https://github.com/moka-rs/moka/issues/34
crossbeam-epoch = "0.8.2"

crossbeam-epoch 0.8.2 depends on crossbeam-utils 0.7.x, which is affected by GHSA-qc84-gqf4-9926

Is the work around in #129 to upgrade moka’s dependency of crossbeam-epoch?

@tatsuya6502
Copy link
Member Author

Hi @SimonSapin,

crossbeam-epoch 0.8.2 depends on crossbeam-utils 0.7.x, which is affected by GHSA-qc84-gqf4-9926

Thank you for the information.

Is the work around in #129 to upgrade moka’s dependency of crossbeam-epoch?

No. I do not think so, unfortunately.

I have another Moka repository here and it has crossbeam-epoch upgraded to v0.9.9:

and I ran the same test on both Moka with crossbeam-epoch v0.8.2 and v0.9.9. I found Moka with crossbeam-epoch v0.9.9 is still having the same issue.

Moka with crossbeam-epoch v0.9.9

Had segfault four times in about four hours.

$ rg '(Segmentation fault|Bus error)' epoch09-2022-0618.log 
271:./run-tests-insert-once.sh: line 26: 94446 Segmentation fault: 11  ./target/release/mokabench --invalidate --insert-once
283:./run-tests-insert-once.sh: line 30: 94453 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once

$ rg '(Segmentation fault|Bus error)' epoch09-2022-0619A.log
243:./run-tests-insert-once.sh: line 18: 99154 Segmentation fault: 11  ./target/release/mokabench --insert-once --size-aware
326:./run-tests-insert-once.sh: line 30: 99301 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once

$ cat epoch09-2022-0618.log
...
cargo tree --all-features  
...
│   ├── crossbeam-epoch v0.9.9
│   │   ├── cfg-if v1.0.0
│   │   ├── crossbeam-utils v0.8.9 (*)

Moka with crossbeam-epoch v0.8.2

Had segfault three times in about four hours.

$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619.log 
349:./run-tests-insert-once.sh: line 26: 95369 Segmentation fault: 11  ./target/release/mokabench --invalidate --insert-once

$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619B.log
339:./run-tests-insert-once.sh: line 30:   478 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once
385:./run-tests-insert-once.sh: line 38:   536 Segmentation fault: 11  ./target/release/mokabench --ttl 3 --tti 1 --invalidate --insert-once --size-aware

$ cat epoch08-2022-0619.log
...
cargo tree --all-features  
...
│   ├── crossbeam-epoch v0.8.2
│   │   ├── cfg-if v0.1.10
│   │   ├── crossbeam-utils v0.7.2

NOTE: To make segfault occurs more often, I used modified Moka to set the number of moka::cht::HashMap segments to 1. (The release versions have it set to 64)

Anyway, I will continue evaluating crossbeam-epoch v0.9.9 in parallel to v0.8.2, and will upgrade Moka's dependency with v0.9.9 once I feel v0.9.9 will not increase the chance of segfaults.

I am also watching every releases of crossbeam-* and parking_lot crates, and testing them if they have any fixes on memory safety issues. I am reviewing Moka and their source codes when I have time. I hope I can isolate the code causing the issue.

@tatsuya6502
Copy link
Member Author

FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9. I scheduled it for next patch release Moka v0.8.7.

As I wrote in the PR, I will run some mokabench tests before merging it. I will be able to run mokabench for 6 hours a day (during night), so if everything goes well, the test will complete in 4 days (total 24 hours).

@tatsuya6502
Copy link
Member Author

FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9.
...
so if everything goes well, the test will complete in 4 days (total 24 hours).

Unfortunately, I found that upgrading crossbeam-epoch to v0.9.9 would actually make this issue worse on Linux x86_64. It occurred ~15% more often with v0.9.9 than v0.8.2. So I am hesitate to merge the PR.

Just for sure, I will do the same test again during this weekend.

tatsuya6502 added a commit that referenced this issue Jul 19, 2022
- Add a lock to the rehash function of the concurrent hash table (`moka::cht`) to
  ensure only one thread can participate rehashing at a time.
- To prevent potential inconsistency issues in non x86 based systems, strengthen the
  memory ordering used for `compare_exchange_weak` (`Release` to `AcqRel`).
@tatsuya6502
Copy link
Member Author

Finally, I believe I fixed this issue via #157.

Last week, I got a new x86_64 based Linux PC with 20 logical cores (Intel Core i7-12700F), and it helped me a lot to reproduce and investigate the issue. I found the cause of the issue last night and fixed it. After the fix, I have never been able to reproduce the issue again on both the PC (Linux x86_64) and Mac (macOS arm64).

The cause was race conditions when many threads are concurrently rehashing (extending or shrinking) internal hash table moka::cht. The creator of the original cht designed it to work fine in such a situation but it is not working as expected. So I added a lock to ensure only one thread can participate rehashing at a time. This actually increased performance in my load tests as it will prevent heavy retries on an atomic CAS operation compare_exhance_weak.

Also I found the memory ordering used for compare_exchange_weak will be too weak for non x86 platforms, and may cause inconsistency between threads. So I changed it to the one that I believe strong enough.

#157 also upgrades crossbeam-epoch to the latest version (v0.9.9).

@tatsuya6502 tatsuya6502 added this to the v0.9.2 milestone Jul 19, 2022
@tatsuya6502
Copy link
Member Author

I have published v0.9.2 with this fix to crates.io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants