Skip to content

check for LockTryOnce before delay #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 28, 2022

Conversation

maxoris
Copy link
Contributor

@maxoris maxoris commented Jul 9, 2020

Solving issue #20

@mfkl
Copy link
Contributor

mfkl commented Jul 16, 2020

Should the unit tests be adjusted/modified to reflect this change? https://github.com/G-Research/consuldotnet/blob/master/Consul.Test/LockTest.cs

Copy link
Member

@jgiannuzzi jgiannuzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per #20 (comment):

A fix would need to check whether LockTryOnce is false and whether the elapsed time is greater than LockWaitTime before waiting for LockRetryTime. If we don't do that, I think we might introduce a hot loop.

As it is right now, your PR introduce a hot loop when LockTryOnce is set to false, and it does not fix #20. Could you please fix the condition?

@jgiannuzzi
Copy link
Member

Should the unit tests be adjusted/modified to reflect this change? https://github.com/G-Research/consuldotnet/blob/master/Consul.Test/LockTest.cs

Thanks @mfkl, that is a very good point!

I do wonder how to properly test this however, knowing that we can't rely on precise timings on the CI system.
Any ideas?

@jgiannuzzi jgiannuzzi added this to the v1.6.9 milestone Oct 20, 2020
@jgiannuzzi jgiannuzzi linked an issue Oct 20, 2020 that may be closed by this pull request
@mfkl
Copy link
Contributor

mfkl commented Feb 11, 2021

I do wonder how to properly test this however, knowing that we can't rely on precise timings on the CI system.
Any ideas?

Maybe with some mocks? But it would require some more work. I'd say if the values used in tests make it pass 100% of the time, it should be fine (though not ideal). This PR needs a rebase before it is merged.

LGTM.

@jgiannuzzi jgiannuzzi modified the milestones: v1.6.10, v1.6.10.2 Jun 25, 2021
@mfkl
Copy link
Contributor

mfkl commented Feb 4, 2022

[xUnit.net 00:03:59.70]     Consul.Test.LockTest.Lock_TryAcquireOnceWithLockDelayNonZeroWaitTime_EnsureRetryWait [FAIL]
Error: Consul.Test.LockTest.Lock_TryAcquireOnceWithLockDelayNonZeroWaitTime_EnsureRetryWait: Assert.True() Failure
  Failed Consul.Test.LockTest.Lock_TryAcquireOnceWithLockDelayNonZeroWaitTime_EnsureRetryWait [5 s]
  Error Message:
   Assert.True() Failure
Expected: True
Actual:   False
  Stack Trace:
     at Consul.Test.LockTest.Lock_TryAcquireOnceWithLockDelayNonZeroWaitTime_EnsureRetryWait() in /home/runner/work/consuldotnet/consuldotnet/Consul.Test/LockTest.cs:line 604

@mfkl
Copy link
Contributor

mfkl commented Feb 25, 2022

That simple attempt worked on my fork's CI :/ https://github.com/mfkl/consuldotnet/actions/runs/1897597504

I can't repro this locally. Inclined to think the test is a bit too strict, timing-wise...

@mfkl
Copy link
Contributor

mfkl commented Mar 17, 2022

On WSL Ubuntu 20.04:

Test Run Successful.
Total tests: 172
     Passed: 168
    Skipped: 4

Yet it looks like I can reproduce the test failure locally on WSL Ubuntu 18.4... only sometimes 🤔

@mfkl
Copy link
Contributor

mfkl commented Apr 7, 2022

With some trouble, I can reproduce sporadically the test failure on ubuntu by letting this script run the test in a loop for a while

#!/bin/bash

err=0
while true
do
    dotnet test Consul.Test --configuration=Release --framework=net6.0 --logger:"console;verbosity=detailed" --filter "Lock_TryAcquireOnceWithLockDelayNonZeroWaitTime_EnsureRetryWait"
    (( (err=$?) > 0 )) && break
done
echo "$err" 

After several minutes, it indeed fails the test. I added debug log statements in a few places, and it turns out that this line

try { await Task.Delay(Opts.LockRetryTime, ct).ConfigureAwait(false); }

doesn't actually do what it's supposed to. With

Console.WriteLine(">>>> Waiting for " + Opts.LockRetryTime.TotalMilliseconds);
Console.WriteLine(">>>> current watch before wait " + sw.ElapsedMilliseconds);
try 
{
    await Task.Delay(Opts.LockRetryTime, ct).ConfigureAwait(false); 
    Console.WriteLine(">>>> current watch after wait " + sw.ElapsedMilliseconds);
}

When the test fails, this actually outputs

>>>> Waiting for 5000
>>>> current watch before wait 2
>>>> current watch after wait 4994

I believe we might be hitting dotnet/runtime#45585. Using a Thread.Sleep here does not sound good, even if only in Linux builds. We could make the test less strict about the timing through (add 10-15 milliseconds just in case), though that feels hackish as well.

@marcin-krystianc
Copy link
Contributor

I could also reproduce this problem locally and it seems clear that dotnet/runtime#45585 is real (or maybe it is just an issue with documentation ?!).
My conclusions:

  • On Linux, measuring elapsed time with Environment.TickCount64 seems to work reliably, whereas Stopwatch can sometimes measure smaller elapsed time than the delay itself.
  • On Windows, there is no reliable technique to measure elapsed time as both Environment.TickCount64 and Stopwatch can sometimes measure smaller elapsed time than the delay itself.

Given that on Windows all existing methods of measuring elapsed time can give value smaller than the delay, I would argue that we should use some fraction of expected delay time in tests e.g.:

// https://github.com/dotnet/runtime/issues/45585
Assert.True(stopwatch.ElapsedMilliseconds > oneShotLockOptions.LockRetryTime.TotalMilliseconds * 0.9);

@mfkl mfkl force-pushed the hotfix/lock-try-once-delay branch from 85d4162 to 2d1a032 Compare April 15, 2022 05:10
Copy link
Contributor

@marcin-krystianc marcin-krystianc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I think it is a good opportunity to add some documentation around LockTryOnce (https://github.com/G-Research/consuldotnet/blob/master/Consul/Lock.cs#L610) option to avoid future confusions:

LockTryOnce = false - Acquire method will block forever until the lock is acquired. LockWaitTime is ignored in this case.
LockTryOnce = true - Acquire the lock within a timestamp (It is analogous to `SemaphoreSlim.Wait(Timespan timeout)`. 
                     Under the hood, it attempts to acquire the lock multiple times if needed (due to the HTTP Long Poll returning early), and will do so as many times as it can within the bounds set by LockWaitTime. 
                     If LockWaitTime is set to 0, there will be only single attempt to acquire the lock. 

jgiannuzzi
jgiannuzzi previously approved these changes Apr 22, 2022
@mfkl mfkl dismissed stale reviews from jgiannuzzi and marcin-krystianc via 0ed8f53 April 28, 2022 05:39
@mfkl mfkl force-pushed the hotfix/lock-try-once-delay branch from 0ed8f53 to 4c3e24c Compare April 28, 2022 05:51
@mfkl mfkl merged commit af93ebf into G-Research:master Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unnecessary delay in LockTryOnce scenario
4 participants