Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed fix for #1183 - multiple regex within short time might give identical strings #1187

Merged
merged 5 commits into from Jul 8, 2020
Merged

Conversation

MagnusMikkelsen
Copy link
Contributor

@MagnusMikkelsen MagnusMikkelsen commented Jun 30, 2020

Included:

  • Test that exposes the issue
  • Proposed fix:
    • Use Xeger(pattern, random) constructor overload
    • Use the same instance of Random each time. I figured it should be enough to avoid the issue with identical seeds.

Fix #1183

…ght yield identical stringsr #1183 (+1 squashed commits)

Squashed commits:

[e8e3387] Proposed fix for #1183 - creating multiple regex within short time might yield identical strings
Copy link
Member

@moodmosaic moodmosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Approved as I've worked together with @MagnusMikkelsen on this in moodmosaic/Fare#56. See the comments I've added though as we might have to make the newly added test more robust.

Src/AutoFixtureUnitTest/RegularExpressionGeneratorTest.cs Outdated Show resolved Hide resolved
Src/AutoFixtureUnitTest/RegularExpressionGeneratorTest.cs Outdated Show resolved Hide resolved
Src/AutoFixture/RegularExpressionGenerator.cs Show resolved Hide resolved
MagnusMikkelsen and others added 2 commits June 30, 2020 14:29
Co-authored-by: Nikos Baxevanis <nikos.baxevanis@gmail.com>
Added a test to FixtureTests, to verify that #1183 is fixed as suggested here: #1187 (comment)

This required making sure that the outcome space of the regex pattern is great enough. Xeger seems to favor shorter strings, and this gave collisions when generating RegularExpressionValidatedType . So minimum length of the pattern in this type was lengthened from 1 to 20.

Removed assertion roulette in RegularExpressionGeneratorTests.CreateMultipleWithRegularExpressionRequestReturnsDifferentResults(...)
according to suggestion: #1187 (comment)
Copy link
Member

@moodmosaic moodmosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

string regex = new Xeger(pattern).Generate();
// Use the Xeger constructor overload that that takes an instance of Random.
// Otherwise identically strings can be generated, if regex are generated within short time.
string regex = new Xeger(pattern, this.random).Generate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my knowledge Random is not thread safe. See e.g. here. Previously we created a random instance per request (probably internally within Xeger), so we didn't have that problem.

Fixture by design can be used from multiple threads simultaneously, so we should take care of multi-threading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll look into it.
Should we use the code from https://codeblog.jonskeet.uk/2009/11/04/revisiting-randomness/ ?

Basically doing Xeger(pattern, ThreadLocalRandom.Instance)

This way we could abstract threadsafety concerns out of RegularExpressionGenerator... 🤔, and maybe reuse it elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MagnusMikkelsen you can just follow the pattern used in all the other generators, e.g.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moodmosaic @MagnusMikkelsen If there is any way to do that without locks, please prefer that way. Introducing locking makes it hard to optimize performance later. When you have a bit suite, AutoFixture already takes substantial amount of time due to uncached reflection is uses everywhere - we shouldn't degrade it even more 😟

In this particular case, it might be better to abstract out random generation and create generator, which uses e.g. Interlocked. Or, for instance, create thread-local random, but use interlocked to generate seed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interlocked.CompareExchange or (IIRC) the Interlocked-extras by Jeffrey Richter in CLR via C# can be candidates.

Though right now, I’d suggest we stick with lock as done in all other places and be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to use Interlocked.CompareExchange for this?

I did a couple of measurements, running the test below.
The times are measured using dotTrace, and are the total time and own time for RegularExpressionGenerator.GenerateRegularExpression.
The last test (ThreadLocal) took 12s. Seems to make sense, given the GenerateRegularExpression used some 85 seconds in total on 8 threads.

Total Time [ms] Own [ms]
Current: shared Random 83387 0
Original: No injected Random 88926 10
New Random each time
locking on generate seed
100843 0
Sharing ThreadLocal
locking when generating ThreadLocal
84848 27

(It is beside the point of this PR - but the Xeger.ctor seems to take most of this time - 81000ms total time. Generate is about 550ms total time. Might be an idea to cache Xeger? Fx dictionary[pattern]=Xeger)

        [Fact]
        public void CreateManyAnonymousInParallelWithRegularExpressionValidatedTypeReturnsCorrectResult()
        {
            // Arrange
            var fixture = new Fixture();
            var result = Enumerable
                .Range(0, 10000)
                .AsParallel()
                .WithDegreeOfParallelism(8)
                .Select(_ => fixture.Create<RegularExpressionValidatedType>())
                .ToArray();

            // Assert

            Assert.All(result, r => Assert.Matches(RegularExpressionValidatedType.Pattern, r.Property));
            Assert.Equal(result.Distinct(), result);
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interlocked-extras by Jeffrey Richter in CLR via C# can be candidates.

Haven't got that book - so I don't know it, do you have any links with examples? I'd like to learn new tricks 😁

Copy link
Contributor Author

@MagnusMikkelsen MagnusMikkelsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like RegularExpressionGenerator owning the ThreadLocal, as it forces RegularExpressionGenerator to be disposable. What are your thoughts

Copy link
Member

@moodmosaic moodmosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MagnusMikkelsen, thank you for the investigation 👍 @zvirja has been much more active on the project, over the last years, than I am, so it’s up to him to decide whether the proposed direction with thread-local storage of data is warranted.

As in #1187 (comment) my opinion remains the same; just use lock and plan for perf improvements separately when needed.

…and with room for future perf improvement 😊
@MagnusMikkelsen
Copy link
Contributor Author

MagnusMikkelsen commented Jul 6, 2020

@MagnusMikkelsen, thank you for the investigation 👍

You're welcome - I'm having fun 😁

just use lock and plan for perf improvements separately when needed.

I think I agree. In case there are many scenarios out there, that uses the same instance of Fixture to generate the same patterns many times - a major improvement would be to cache Xeger.

Just ran a small experiment in linqpad. Creating 1000 Xeger and generating regex takes 2 seconds. Reusing a Xeger to generate 1000 regex took ca 40ms.

@zvirja
Copy link
Member

zvirja commented Jul 6, 2020

@MagnusMikkelsen Thanks for the rework and fixing the concurrency issue. I fully agree that now we can proceed further with the current approach and merge the code. You are guarding by lock only random seed generation, so it sounds reasonable. My concern was about protecting the whole Create() method.

Just in case you are curious, my idea was to have something like:

static private int RandomSeed = Environment.TickCount;
...

new Xeger(new Random(Interlocked.Add(ref RandomSeed, 1)))

This way code is completely lock-free and you are always giving a new seed to random. I believe in this case it doesn't matter whether we just increment seed by one, or use completely random input - the Random class should randomize it anyway.

Anyway, it was just in case you are curious and I'm fine with the current implementation. Let me know when you want me to merge the code.

Copy link
Member

@zvirja zvirja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 👍

@moodmosaic
Copy link
Member

Just in case you are curious, my idea was to have something like:

new Xeger(new Random(Interlocked.Add(ref RandomSeed, 1)))

From a thread-safety viewpoint, this looks reasonable. In the context of pseudorandom number generators, this must be sampled in order to make sure it'll produce output that is uniformly distributed random and birthday paradox-free, at the very least.

@MagnusMikkelsen
Copy link
Contributor Author

Given @moodmosaic comment, shouldn't we just merge it as is @zvirja ?

@zvirja
Copy link
Member

zvirja commented Jul 8, 2020

@moodmosaic The same should be said about the current approach, shouldn't it? 😉 If none of the approaches was tested to prove the statistical quality of the resulting random sequence, we could just ignore this aspect, assuming that both approaches are equal. And with that we could solely focus on performance aspect.

Anyway, merging the code! Thanks again @MagnusMikkelsen for your help!

@zvirja zvirja merged commit d2fa483 into AutoFixture:master Jul 8, 2020
@moodmosaic
Copy link
Member

Yeah, but at least right we're following the 'pattern' so we're at least consistent.

@zvirja
Copy link
Member

zvirja commented Jul 8, 2020

@aivascu I suggest to drop a new release, so the fix could be consumed. Let me know if you need my assistance on this ;-)

@moodmosaic
Copy link
Member

I suggest to drop a new release, so the fix could be consumed

👍 🎊 🚀


Thanks @MagnusMikkelsen for your work and @zvirja for the great feedback 🍻

@aivascu
Copy link
Member

aivascu commented Jul 9, 2020

@zvirja the release 4.13.0 has been deployed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating multiple regular expression requests in rapid succession yields same result (.NET framework)
4 participants