New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed fix for #1183 - multiple regex within short time might give identical strings #1187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Approved as I've worked together with @MagnusMikkelsen on this in moodmosaic/Fare#56. See the comments I've added though as we might have to make the newly added test more robust.
Co-authored-by: Nikos Baxevanis <nikos.baxevanis@gmail.com>
Added a test to FixtureTests, to verify that #1183 is fixed as suggested here: #1187 (comment) This required making sure that the outcome space of the regex pattern is great enough. Xeger seems to favor shorter strings, and this gave collisions when generating RegularExpressionValidatedType . So minimum length of the pattern in this type was lengthened from 1 to 20. Removed assertion roulette in RegularExpressionGeneratorTests.CreateMultipleWithRegularExpressionRequestReturnsDifferentResults(...) according to suggestion: #1187 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
string regex = new Xeger(pattern).Generate(); | ||
// Use the Xeger constructor overload that that takes an instance of Random. | ||
// Otherwise identically strings can be generated, if regex are generated within short time. | ||
string regex = new Xeger(pattern, this.random).Generate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my knowledge Random
is not thread safe. See e.g. here. Previously we created a random instance per request (probably internally within Xeger), so we didn't have that problem.
Fixture by design can be used from multiple threads simultaneously, so we should take care of multi-threading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll look into it.
Should we use the code from https://codeblog.jonskeet.uk/2009/11/04/revisiting-randomness/ ?
Basically doing Xeger(pattern, ThreadLocalRandom.Instance)
This way we could abstract threadsafety concerns out of RegularExpressionGenerator... 🤔, and maybe reuse it elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MagnusMikkelsen you can just follow the pattern used in all the other generators, e.g.
lock (this.syncRoot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moodmosaic @MagnusMikkelsen If there is any way to do that without locks, please prefer that way. Introducing locking makes it hard to optimize performance later. When you have a bit suite, AutoFixture already takes substantial amount of time due to uncached reflection is uses everywhere - we shouldn't degrade it even more 😟
In this particular case, it might be better to abstract out random generation and create generator, which uses e.g. Interlocked
. Or, for instance, create thread-local random, but use interlocked to generate seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interlocked.CompareExchange or (IIRC) the Interlocked-extras by Jeffrey Richter in CLR via C# can be candidates.
- Unless you measure it, you can’t actually know how much ‘faster’ it’s going to be
- If speed matters, perhaps we can consider Rework random generator to use LCG #928 — or look into other PRNGs
Though right now, I’d suggest we stick with lock
as done in all other places and be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to use Interlocked.CompareExchange for this?
I did a couple of measurements, running the test below.
The times are measured using dotTrace, and are the total time and own time for RegularExpressionGenerator.GenerateRegularExpression
.
The last test (ThreadLocal) took 12s. Seems to make sense, given the GenerateRegularExpression
used some 85 seconds in total on 8 threads.
Total Time [ms] | Own [ms] | |
---|---|---|
Current: shared Random |
83387 | 0 |
Original: No injected Random |
88926 | 10 |
New Random each timelocking on generate seed |
100843 | 0 |
Sharing ThreadLocal locking when generating ThreadLocal |
84848 | 27 |
(It is beside the point of this PR - but the Xeger.ctor
seems to take most of this time - 81000ms total time. Generate
is about 550ms total time. Might be an idea to cache Xeger
? Fx dictionary[pattern]=Xeger
)
[Fact]
public void CreateManyAnonymousInParallelWithRegularExpressionValidatedTypeReturnsCorrectResult()
{
// Arrange
var fixture = new Fixture();
var result = Enumerable
.Range(0, 10000)
.AsParallel()
.WithDegreeOfParallelism(8)
.Select(_ => fixture.Create<RegularExpressionValidatedType>())
.ToArray();
// Assert
Assert.All(result, r => Assert.Matches(RegularExpressionValidatedType.Pattern, r.Property));
Assert.Equal(result.Distinct(), result);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interlocked-extras by Jeffrey Richter in CLR via C# can be candidates.
Haven't got that book - so I don't know it, do you have any links with examples? I'd like to learn new tricks 😁
… now only shared on each thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I like RegularExpressionGenerator owning the ThreadLocal, as it forces RegularExpressionGenerator to be disposable. What are your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MagnusMikkelsen, thank you for the investigation 👍 @zvirja has been much more active on the project, over the last years, than I am, so it’s up to him to decide whether the proposed direction with thread-local storage of data is warranted.
As in #1187 (comment) my opinion remains the same; just use lock
and plan for perf improvements separately when needed.
…and with room for future perf improvement 😊
You're welcome - I'm having fun 😁
I think I agree. In case there are many scenarios out there, that uses the same instance of Fixture to generate the same patterns many times - a major improvement would be to cache Xeger. Just ran a small experiment in linqpad. Creating 1000 |
@MagnusMikkelsen Thanks for the rework and fixing the concurrency issue. I fully agree that now we can proceed further with the current approach and merge the code. You are guarding by lock only random seed generation, so it sounds reasonable. My concern was about protecting the whole Just in case you are curious, my idea was to have something like: static private int RandomSeed = Environment.TickCount;
...
new Xeger(new Random(Interlocked.Add(ref RandomSeed, 1))) This way code is completely lock-free and you are always giving a new seed to random. I believe in this case it doesn't matter whether we just increment seed by one, or use completely random input - the Anyway, it was just in case you are curious and I'm fine with the current implementation. Let me know when you want me to merge the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! 👍
From a thread-safety viewpoint, this looks reasonable. In the context of pseudorandom number generators, this must be sampled in order to make sure it'll produce output that is uniformly distributed random and birthday paradox-free, at the very least. |
Given @moodmosaic comment, shouldn't we just merge it as is @zvirja ? |
@moodmosaic The same should be said about the current approach, shouldn't it? 😉 If none of the approaches was tested to prove the statistical quality of the resulting random sequence, we could just ignore this aspect, assuming that both approaches are equal. And with that we could solely focus on performance aspect. Anyway, merging the code! Thanks again @MagnusMikkelsen for your help! |
Yeah, but at least right we're following the 'pattern' so we're at least consistent. |
@aivascu I suggest to drop a new release, so the fix could be consumed. Let me know if you need my assistance on this ;-) |
👍 🎊 🚀 Thanks @MagnusMikkelsen for your work and @zvirja for the great feedback 🍻 |
@zvirja the release 4.13.0 has been deployed |
Included:
Xeger(pattern, random)
constructor overloadRandom
each time. I figured it should be enough to avoid the issue with identical seeds.Fix #1183