Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hill climbing #191

Merged
merged 10 commits into from Aug 22, 2022
Merged

hill climbing #191

merged 10 commits into from Aug 22, 2022

Conversation

bitfaster
Copy link
Owner

@bitfaster bitfaster commented Aug 21, 2022

Use hill climbing to optimize hit rate by adapting the size of the window and main segments based on changes in the hit rate at run time.

This significantly improves hit rate for ARC OLTP, see below.

ARC OLTP

CacheSize ClassicLruHitRate ConcurrentLruHitRate ConcurrentLfuHitRate
250 16.473425988218498 15.629686756477366 24.727368196511495
500 23.445405269404745 27.128847174135394 33.69760814750395
750 28.28074320813438 32.290829135421625 37.114790323198186
1000 32.83089663018449 36.014417843996306 39.95033610641638
1250 36.209682271412085 38.6823753343288 42.33114002701978
1500 38.69604931383971 40.69409119997375 43.71144621476899
1750 40.7881681790088 42.47695934452412 45.01386541522406
2000 42.46973948334236 44.16925104879423 46.073434739565386

image

Previous result without hill climbing:

image

ARC Database

CacheSize ClassicLruHitRate ConcurrentLruHitRate ConcurrentLfuHitRate
1000000 3.0858932571504036 11.94732984541647 14.149671299179376
2000000 10.744535536786323 23.242502873642838 28.39640156070473
3000000 18.5903853197138 36.190929184521515 39.71614424231626
4000000 20.244791789054513 39.605386837046645 45.10053800574032
5000000 21.031951531197397 45.15375010247688 50.872148192684094
6000000 33.95308575711706 50.37196791697349 57.59385836419041
7000000 38.8978587542623 56.082914488987626 63.89215662169152
8000000 43.03472380114861 61.00455968643755 70.23867518909964

image

@bitfaster bitfaster marked this pull request as ready for review August 22, 2022 03:53
@bitfaster bitfaster merged commit 20f8412 into main Aug 22, 2022
@bitfaster bitfaster deleted the users/alexpeck/climb branch August 22, 2022 03:53
double amount = (hitRateChange >= 0) ? stepSize : -stepSize;

double nextStepSize = (Math.Abs(hitRateChange) >= HillClimberRestartThreshold)
? HillClimberStepPercent * (amount >= 0 ? 1 : -1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be multiplied by the maximum?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without max, amount is computed as a percent change to the ratio of window to main - since it's a percentage I just directly add it to mainRatio (also a percentage), and pass that into the ComputeQueueCapacity() function I already had. Probably I should clean this up, - I was excited to see it working.

In your code I think amount is computed as the actual number of slots, and setAdjustment() adds/subtracts this number from the main and window queue capacity (I couldn't find where this is implemented searching your code, but it seems like the amount unit must be number of slots).

Do you have any clamping to prevent drift to an invalid state?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The increase/decrease methods check that it doesn’t go beyond a bound. Sounds like it should have the same result?

Congrats on getting it working! Have you tried the stress test scenario yet (corda & loop)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh in mine it’s number of units where an entry may take multiple (e.g. memory bound). Not sure if that adds a wrinkle or works fine in your approach.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested exhaustively yet, but I think it is working well.

Uniform weight for sure makes it much simpler - I was puzzled for a while about all the different cases handled in your evictFromMain method, then I realized that weighting results in more complicated queue configurations that are unreachable for me.

I will try combining corda and loop - I made a unit test called WhenHitRateFluctuatesWindowIsAdapted that does a sanity check by just manipulating the hit rate and it works as expected. Since I copy pasted all your carefully tuned parameters and the core logic, I think on the same traces it will produce a very similar if not identical result.

It is really an ingenious addition - probably 5 or 6 lines of code that increases hit rate by > 5%.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, that's crazy. I wonder what AMD could be doing???

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird. I read in Agner Fog's architecture manual that zen rapidly adjusts the clock speed and it can make it hard to measure performance. It will be under constant load during the benchmark so it seems unlikely it would clock down, and anyway it always affects the same data size - likely it's related to the cache somehow. It's quite a big fluctuation.

I tried running with affinity to stick to a single core and did a hack to make sure the array is always at a 4 byte aligned address. Neither made any difference.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Zen 1 can do two memory read operations or one read and one write operation, but not
two write operations, in the same clock cycle. The Zen 2 can do two memory read and one
write operation per clock cycle.

Maybe it is not realizing that the writes are to the same cache line and is creating a store buffer data dependency? Ideally it would coalesce the writes into one memory operation, but if not then I suppose it would be slower. How was the frequency performance?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - frequency is better with block on AMD and matches expectations. The strange part is that it varies run to run - hence force alignment etc trying to reduce variables.

I pulled all the data for the eviction test, and block is equal or better in all cases:

image

I wondered if the difference you saw in your CI results was due to the difference between AMD and Intel architectures, but my result is not totally the same since you had both frequency and increment with similar degradation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the read throughput compare? An increment is the common case, but strangely that was faster in my CI runs too even though it claimed the sketch was slower (sequenced steps on the same machine).

Very confusing, but I'm glad the end result benchmarks are all showing our ideas turned out well, but still.. 🤷‍♂️

@bitfaster
Copy link
Owner Author

I haven't tested pure read. For read+write, I tested size=500 and it's similar to eviction. From memory I think it was about 9.5 million ops/sec for flat, ~10.5 for flat avx and block, and about 11.5 million for block AVX. Block is consistently better, and the AVX version gives a larger benefit. I will do proper comparison with all sizes when I get a chance.

I can't argue with the results, but I would like to understand the cause of the fluctuations. I will run the tests on the current Azure offerings, I think the AMD skus are based on Zen 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants