Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Uniform<f32/f64> and gen_range<f32/f64> honor limits strictly. Resolves #476 #477

Merged
merged 3 commits into from Jul 12, 2018

Conversation

sicking
Copy link
Contributor

@sicking sicking commented May 25, 2018

This adds a small amount of overhead to Uniform<f32/f64> and gen_range<f32/f64>. For Uniform<f32/f64> it adds one subtraction and one branch in the common case. For gen_range<f32/f64> it adds just one branch in the common case. I see no way around this while also being sure to stay in the requested range.

This does use f64::from_bits/to_bits/is_infinite/is_finite which might only be available through std? Though it looks like core::num also make them available here? I can't tell given that the functions are marked as stable, but the trait marked unstable.

I'm not sure what are ok build requirements for no_std builds? And is there any way to run cargo test, or even just cargo build in no_std mode?

@sicking
Copy link
Contributor Author

sicking commented May 25, 2018

I guess travis answered the last paragraph...

@sicking
Copy link
Contributor Author

sicking commented May 25, 2018

Cool. Seems like that takes care of the no_std problems.

@pitdicker
Copy link
Contributor

I understand your motivation, but I am not sure I agree with the method.

It is not impossible to make a half-open range exact, i.e. to guarantee low can occur and high never does. It could be done as a check when constructing the range that low + scale * (1 - EPSILON) (or some other variant, didn't look closely at the code yet) is less than high, and to otherwise adjust scale.

That could be faster for many samples, but slower for few or only one.

Can you measure one thing: what do the range benchmarks do with the extra loop?

@dhardy
Copy link
Member

dhardy commented May 25, 2018

This feels premature to me since (a) it is difficult in general to ensure strict limits with FP, thus often better to design algorithms to accommodate (small) deviations, and (b) not all users care about strict compliance anyway. That said if there is very little (performance/maintenance) overhead then this may be reasonable, but given all the extra code I doubt that is the case.

@sicking
Copy link
Contributor Author

sicking commented May 25, 2018

Given the low performance cost here I think it's worth providing the guarantee that the generated values are within the requested range.

I've implemented @pitdicker suggestion to adjust scale in UniformFloat::new/ UniformFloat::new_inclusive, which means that the performance cost when sampling from a range is just one floating subtraction, and the cost when using Rng.gen_range is just one branch.

I'll pull numbers to see if this is visible in our benchmarks.

@sicking
Copy link
Contributor Author

sicking commented May 26, 2018

So I ran some performance numbers. At first distr_uniform_f32 showed a 6.0% slowdown, and distr_uniform_f64 showed a 3.8%. Which to me seems quite acceptable.

Then I added gen_range_f32 and gen_range_f64, using the exact same code as in gen_range_i8. Here gen_range_f32 showed almost a 2x speedup with this PR, while gen_range_f64 showed almost a 2x slowdown with the PR. No idea what's going on there!

Also confused why some builds seem unable to find the black_box function, even though it's used just fine elsewhere in the same file. Will look into that after the weekend.

@pitdicker
Copy link
Contributor

Sorry, I've been a bit slow this weekend. Will have a better look tomorrow.

We had trouble with confusing benchmarks in combination with XorShiftRng before. The assembly is even identical, still it can perform very differently... 🙁

The best thing to do than is to try again with StdRng. Benchmarks before:

test gen_range_f32               ... bench:       3,388 ns/iter (+/- 54) = 1180 MB/s
test gen_range_f64               ... bench:       5,000 ns/iter (+/- 317) = 1600 MB/s

After:

test gen_range_f32               ... bench:       3,800 ns/iter (+/- 5) = 1052 MB/s
test gen_range_f64               ... bench:       5,412 ns/iter (+/- 4) = 1478 MB/s

So the difference in both benchmarks is 0.412 ns. distr_uniform_f32 was 1.8 ns/iter (with Xorshift) on my machine, so about 22% overhead. And for the f64 version about 16%.

Also confused why some builds seem unable to find the black_box function, even though it's used just fine elsewhere in the same file. Will look into that after the weekend.

I find it strange also, but would not worry about it. If you rebase on master you'll see all the black_box-es are unnecessary now.

@sicking
Copy link
Contributor Author

sicking commented Jun 4, 2018

Yeah, using StdRng I get similar numbers:

Before:

test gen_range_f32               ... bench:       2,870 ns/iter (+/- 62) = 1393 MB/s
test gen_range_f64               ... bench:       4,213 ns/iter (+/- 82) = 1898 MB/s
test distr_uniform_f32           ... bench:       1,497 ns/iter (+/- 44) = 2672 MB/s
test distr_uniform_f64           ... bench:       2,099 ns/iter (+/- 56) = 3811 MB/s

After:

test gen_range_f32               ... bench:       3,781 ns/iter (+/- 198) = 1057 MB/s
test gen_range_f64               ... bench:       4,940 ns/iter (+/- 415) = 1619 MB/s
test distr_uniform_f32           ... bench:       1,614 ns/iter (+/- 31) = 2478 MB/s
test distr_uniform_f64           ... bench:       2,188 ns/iter (+/- 30) = 3656 MB/s

So a 32% slowdown for gen_range_f32, and a 17% for gen_range_f64.

But for the more performance-optimized API of creating a Uniform<f32/f64> the slowdown is 7.8% for distr_uniform_f32 and 4.2% for distr_uniform_f64.

Definitely more slowdown than desired for gen_range, presumably due to the added branch getting mispredicted, but I think the added correctness is worth it.

@sicking
Copy link
Contributor Author

sicking commented Jun 4, 2018

As a test, I tried adding branch-predictor hints to the gen_range code path, and that significantly brought down the overhead. Now I get:

test gen_range_f32               ... bench:       3,533 ns/iter (+/- 70) = 1132 MB/s
test gen_range_f64               ... bench:       4,447 ns/iter (+/- 74) = 1798 MB/s

Which is a 23% slowdown for gen_range_f32 and 5.5% slowdown for gen_range_f64. Unfortunately the branch-predictor hinting API isn't at all stable yet, so not something we can use right now.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure whether to accept this PR though it may be a good choice, especially in light of #494. However, a few changes are needed.

offset: offset,
assert!(low.is_finite() && high.is_finite(), "Uniform::new called with non-finite boundaries");

let mut scale = high - low;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can overflow to infinite even if both inputs are finite. Skip the above check (you already know neither value is an NaN) and just assert scale < INFINITY.

Also means you can simplify your Float trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally didn't want to panic when high and low are both finite numbers. Even if high-low results in a non-finite number. Mainly because it felt like a lot to ask for users to know when that difference might overflow into infinity. But a little because depending on that the difference can be expressed by an $ty is technically an implementation detail. Another implementation of this function might not.

Either way, the code below deals fine with if scale becomes infinite here.

let mut scale = high - low;

let max_rand = (::core::$uty::MAX >> $bits_to_discard)
.into_float_with_exponent(0) - 1.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is 1-ε/2 (i.e. the largest representable number < 1), but the name makes sense.

}

debug_assert!(scale >= 0.0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If high is the next representable number above low then scale is effectively zero — although larger values than zero may still give the same rounding. Probably though we should accept zero.

Note that your while loop would take a long time to get to zero since high - low may be much larger than the minimal representable value, though it likely doesn't matter since it should encounter a value of scale small enough that low + scale rounds to low much sooner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm banking on that scale * max_rand + low > high only happens in situations with unfortunate rounding, so we should only need to make small adjustments to scale to get it within the acceptable range. Especially since max_rand is already < 1.

.into_float_with_exponent(0) - 1.0;

while scale * max_rand + low >= high {
scale = <$ty>::from_bits(scale.to_bits() - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always correct when the fraction part is zero, and does it handle sub-normals correctly? I think so. In any case this feels unsafe to me — better to have a single next_smallest_value() method in your Float trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will work as long as long as we don't cross the line from positive to negative numbers. Including when it results in reducing the exponent, and in the sub-normal range.

IEEE numbers were designed specifically such that for non-NaN numbers you can check which is bigger of two numbers by comparing their binary representation as if it was integers. Though you have to filter out the sign-bit first, but since we're dealing with positive numbers that won't be a problem.

It even works for infinity, which I am relying on here.

// as it could be, we'll still generate good values.
// But give it one try to increase our chances that
// Uniform::new_inclusive is actually inclusive of the high value.
let new_scale = <$ty>::from_bits(scale.to_bits() + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's sufficient justification for this code — it's extra complexity to slightly improve the range in a case where it already meets all required constraints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion. But without this code it seems like there is very little practical difference between new and new_inclusive. I.e. it's very unlikely that the high value can actually be generated. It's arguably still very unlikely, but at least this code gives it a try.

But again, I'm not super opinionated on this. I'm not sure what the use case is for inclusive ranges for floats, but obviously the API requires it, so we need to stick something in this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there many cases where this code will do something useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to measure usefulness given that I don't know of use cases for inclusive ranges for floats.

Given that there was a comment in the code which pointed out that new and new_inclusive generated the same values, do we have any use cases in mind where that is a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach here could be to implement new_inclusive as

let max_rand = (::core::$uty::MAX >> $bits_to_discard)
               .into_float_with_exponent(0) - 1.0;

let mut scale = (high - low) / max_rand;

// Adjust scale lower until the max value we can generate is `high`
while scale * max_rand + low > high {
    scale = <$ty>::from_bits(scale.to_bits() - 1);
}

debug_assert!(scale >= 0.0);

UniformFloat { low, scale }

I don't know if the / max_rand actually makes a difference. But from a mathematical point of view it at least means that we increase the chance of being able to generate high.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1.0 / max_rand > 1.0 so it probably does do something. Can the compiler optimise to multiplication or would it be better to write (high - low) * (1.0 / max_rand)?

The implementation is simpler at least — unless the current version has a clear advantage this looks preferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use this approach rather than the one I'm using right now.

Is the loss of precision writing (high - low) * (1.0 / max_rand) really worth making the ctor faster? I.e. is a ctor performance really a priority?

Copy link
Member

@dhardy dhardy Jun 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Possibly not, but I think new_inclusive may get used for single-usage objects quite a bit. But anyway, we're not really losing precision since the while loop then reduces scale.


// Get a value in the range [0, 1) in order to avoid
// overflowing into infinity when multiplying with scale
let value0_1 = value1_2 - 1.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loses a bit of precision everywhere just to handle scales close to what would round to infinity. I'm not sure it's worth it; we could instead assert that scale * 2 is finite in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't lose any precision. Subtracting 1.0 should not result in any rounding or loss of data. In fact, it looks to me like it should gain a tiny amount of precision since we're directly adding low rather than the calculated (and thus possibly slightly rounded) offset.

But it does lose performance.

Either way, not having this here is certainly possible, but I think it'll require a decent amount more code in the constructor. We'll have to worry about not just adjusting the scale, but also adjusting offset such that rounding can't result in values lower than low.

So it's a tradeoff of slower and more code complexity in the constructor, vs. saving around 8% in the sampling code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right about the precision.

I don't know what's the right answer regarding performance.


// Get a value in the range [0, 1) in order to avoid
// overflowing into infinity when multiplying with scale
let value0_1 = value1_2 - 1.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, you're dropping a bit of precision everywhere. So why not do value1_2 * scale + offset as before, then at the end of the loop check whether res is finite and panic if not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but I see this only costing performance, not any precision. But do let me know if I'm wrong?

In this case it should really cost very little performance though. While we need an additional subtraction here, we are also saving a subtraction since we're not calculating offset. Though for constant inputs, offset and scale can be computed at compile time. But arguably someone caring about performance and using constant ranges should create a Uniform.

Additionally, using the offset approach means that we have to check the resulting value both against high and against low, so the resulting code is slower. Here's the perf numbers I get:

Using value1_2 * scale + offset

test gen_range_f32               ... bench:       4,835 ns/iter (+/- 857) = 827 MB/s
test gen_range_f64               ... bench:       6,896 ns/iter (+/- 641) = 1160 MB/s

Using value0_1 * scale + low

test gen_range_f32               ... bench:       4,335 ns/iter (+/- 251) = 922 MB/s
test gen_range_f64               ... bench:       5,712 ns/iter (+/- 916) = 1400 MB/s

@dhardy dhardy added D-review Do: needs review D-changes Do: changes requested labels Jun 8, 2018
@sicking sicking force-pushed the limit_inexact_float branch 2 times, most recently from a4c6b65 to 8828cad Compare June 8, 2018 22:47
@sicking
Copy link
Contributor Author

sicking commented Jun 8, 2018

Here's an alternative approach for implementing UniformFloat::new which would let us keep Uniform::sample as-is and avoid the perf-hit.

However it seems hangs for really large values of high. I think the fact that both offset and scale depend on both high and low, in some cases the algorithm below ends up adjusting the wrong edge. Most likely related to when scale and/or offset end up being infinite and/or NaN.

I'm sure it can be fixed, but ultimately I don't think the complexity and precision loss is worth it. And the algorithm here has effectively unbounded performance cost, which is also a concern.

So I think we should stick with the approach in the PR.

fn new(low: Self::X, high: Self::X) -> Self {
    assert!(low < high, "Uniform::new called with `low >= high`");
    assert!(low.is_finite() && high.is_finite(), "Uniform::new called with non-finite boundaries");

    fn adjust_down(val: $ty) -> $ty {
        if val > 0.0 {
            <$ty>::from_bits(val.to_bits() - 1)
        } else if val < 0.0 {
            <$ty>::from_bits(val.to_bits() + 1)
        } else {
            -<$ty>::from_bits(1)
        }
    }

    let max_rand = (::core::$uty::MAX >> $bits_to_discard)
                   .into_float_with_exponent(0);
    let min_rand = (0 as $uty >> $bits_to_discard)
                   .into_float_with_exponent(0);

    let mut low_adjusted = low;
    let mut high_adjusted = high;
    loop {
        let scale = high_adjusted - low_adjusted;
        let offset = low_adjusted - scale;
        if !(scale * max_rand + offset < high) {
            high_adjusted = adjust_down(high_adjusted);
        } else if !(scale * min_rand + offset >= low) {
            low_adjusted = -adjust_down(-low_adjusted);
        } else {
            return UniformFloat { scale, offset };
        }
    }
}

scale: scale,
offset: offset,
assert!(low <= high, "Uniform::new_inclusive called with `low > high`");
assert!(low.is_finite() && high.is_finite(), "Uniform::new_inclusive called with non-finite boundaries");
Copy link
Member

@dhardy dhardy Jun 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reduce the bounds checks a bit: (NEG_INFINITY < low) & (low <= high) & (high < INFINITY). Or maybe && but in theory that can short-circuit which isn't useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a great idea! But I don't understand why we wouldn't want to short-circuit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it might prevent the processor testing multiple things in parallel, and because the common case is that all these tests pass. I don't know; it might be optimisable to the same thing anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out that the old approach is actually slightly faster:

&
test gen_range_f32               ... bench:       4,459 ns/iter (+/- 169) = 897 MB/s
test gen_range_f64               ... bench:       6,027 ns/iter (+/- 334) = 1327 MB/s

&&
test gen_range_f32               ... bench:       4,464 ns/iter (+/- 163) = 896 MB/s
test gen_range_f64               ... bench:       6,038 ns/iter (+/- 592) = 1324 MB/s

old
test gen_range_f32               ... bench:       4,382 ns/iter (+/- 135) = 912 MB/s
test gen_range_f64               ... bench:       5,665 ns/iter (+/- 682) = 1412 MB/s

Technically none of these differences are stasticially significant, however it definitely looks like the last is faster for f64. And it does provide better error messages.

My best guess is that the old approach ends up being implementable by using bit logic, whereas the NEG_INF < low checks has to be done using slower floating point instructions.

@sicking sicking force-pushed the limit_inexact_float branch 2 times, most recently from 3b1ab6e to c1cbf32 Compare June 15, 2018 17:48
@sicking sicking force-pushed the limit_inexact_float branch 2 times, most recently from 28b3fbf to 42f79af Compare July 2, 2018 00:46
@sicking
Copy link
Contributor Author

sicking commented Jul 2, 2018

I've rebased this on top of the newly landed SIMD code, which was definitely non-trivial. The good news is that it's enforcing the limits for both primitives and SIMD types.

I also found a way to make sample_single faster, so it should be less of a perf-hit compared to what we're currently doing.

I'd really like to get this landed though. Especially since the weighted sampling has landed and currently risks panicing randomly if you use f32/f64 weights.

@sicking sicking force-pushed the limit_inexact_float branch 2 times, most recently from 47f2998 to e487112 Compare July 2, 2018 06:31
@sicking
Copy link
Contributor Author

sicking commented Jul 2, 2018

So I ran some performance numbers, and got quite strange results.

Basically if I remove the

assert!(low.all_finite() && high.all_finite(),
        "Uniform::sample_single called with non-finite boundaries");

then I see no performance degradation for gen_range. Which makes some sense since the new implementation has the same number of operations other than an extra branch in the common case. So if the branch is well predicted there should be no performance hit.

However, if I add back the above code, then I see a 50% slowdown. Even though the above code never runs during the benchmark. In the new implementation the code above is only run if high - low rounds up to infinity.

I suspect that the problem is related to when the optimizer decides to inline some critical function, but sprinkling #[inline(always)] everywhere doesn't seem to make a difference as long as the above code is there.

But I suspect that this problem is one of the realities of using micro-benchmarks. I.e. I don't think a real-world usage of gen_range would see the same issue. Though it might still well be that the new code doesn't inline as well since it's bigger.

In theory we could break out some of the code into a separate non-inlined function. But since it'd lead to significant code duplication, I don't think it's worth it. Especially since we don't know how a future version of rustc will optimize these things.

@sicking
Copy link
Contributor Author

sicking commented Jul 2, 2018

Oh, and I should say that I got these numbers using StdRng, so it's not XorShiftRng-weirdness.

@dhardy
Copy link
Member

dhardy commented Jul 2, 2018

Perhaps you could simplify your test?

assert!(scale.all_finite(), "Uniform::sample_single called with non-finite boundaries");

@sicking
Copy link
Contributor Author

sicking commented Jul 2, 2018

Unfortunately scale can be infinite even if high and low are finite. It can happen when low is a large negative value, and high is a large positive value. In that case high - low can be larger than the maximum finite value representable, and round to infinity.

@sicking
Copy link
Contributor Author

sicking commented Jul 2, 2018

I should also mention that the CompareAll trait is pretty poorly named with the changes in this PR. It's probably worth giving it a better name and moving the CastFromInt::cast_from_int function into this new trait.

@dhardy
Copy link
Member

dhardy commented Jul 2, 2018

But do we really care about "nearly infinite" ranges? If someone really is treading so close to the limits of f32 / f64 they should probably reconsider their approach anyway (besides which, your adjust_down function may take an age to find a suitable result which may not be close to the original). At least try benchmarking with this check.

@sicking
Copy link
Contributor Author

sicking commented Jul 2, 2018

I tried just checking scale for infinity, but the performance result is similar to checking high and low. A few percent faster, but it was in the noise.

I'm also a bit hesitant to change the API to work around optimizer quirks which could change with so many parameters, like future versions of rustc, what the calling code does, and which rng is used.

FWIW, I just tried using XorShiftRng, which is what our benchmarks normally use, and I don't see the same issue. It does still make a difference if the checks are there or not, but there's no slowdown compared to current HEAD. (Though using XorShiftRng and without the infinity checks, the new code is 50% faster for f32. 🤷‍♂️)

@dhardy
Copy link
Member

dhardy commented Jul 12, 2018

Okay, I think it's time I stopped opposing this based on performance overhead; it's barely above the noise threshold (notice how unrelated numbers change here):

# before
test gen_range_f32               ... bench:       4,259 ns/iter (+/- 138) = 939 MB/s
test gen_range_f64               ... bench:       4,034 ns/iter (+/- 148) = 1983 MB/s
test gen_range_i16               ... bench:       5,009 ns/iter (+/- 352) = 399 MB/s
test gen_range_i32               ... bench:       2,892 ns/iter (+/- 81) = 1383 MB/s
test gen_range_i64               ... bench:       6,663 ns/iter (+/- 177) = 1200 MB/s
test gen_range_i8                ... bench:       4,802 ns/iter (+/- 87) = 208 MB/s
# after
test gen_range_f32               ... bench:       4,806 ns/iter (+/- 302) = 832 MB/s
test gen_range_f64               ... bench:       4,523 ns/iter (+/- 197) = 1768 MB/s
test gen_range_i16               ... bench:       4,794 ns/iter (+/- 100) = 417 MB/s
test gen_range_i32               ... bench:       2,997 ns/iter (+/- 458) = 1334 MB/s
test gen_range_i64               ... bench:       6,632 ns/iter (+/- 223) = 1206 MB/s
test gen_range_i8                ... bench:       5,033 ns/iter (+/- 62) = 198 MB/s

Is this ready?

Please rebase, and please separate the new benchmarks into a separate commit (so it's easier to do before/after comparisons).

@sicking
Copy link
Contributor Author

sicking commented Jul 12, 2018

Only question I had remaining is if we want to merge the CompareAll and CastFromInt traits. And give the resulting trait a better name, like FloatUtils or FloatSIMDHelper or some such.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care much about renaming (these are only internal, right?) but it's probably better to keep the two traits separate if there's no reason to merge.

#[inline(always)]
fn extract(self, index: usize) -> Self { assert_eq!(index, 0); self }
#[inline(always)]
fn replace(self, index: usize, new_value: Self) -> Self { assert_eq!(index, 0); new_value }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more appropriate to use debug_assert here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to replicate the behavior of these function on real SIMD types, and hoping that the compiler would optimize the checks away.

Ultimately doesn't really matter right now. These functions are only used in tests.

Happy to go either way.

I did change decrease_masked though to not cast the bool flag to an int. Instead I made it part of the contract that at least one "lane" must be set and added debug_asserts to ensure that. That way we can always subtract 1. This might have gotten optimized away, but we might as well be safe.

@sicking
Copy link
Contributor Author

sicking commented Jul 12, 2018

I don't care much about renaming (these are only internal, right?) but it's probably better to keep the two traits separate if there's no reason to merge.

Yeah, they're only used internally. It's slightly less code to merge them, so I added a commit for that. But I kept the commit separate so feel free to take it or not.

@dhardy dhardy merged commit 515cf18 into rust-random:master Jul 12, 2018
@dhardy
Copy link
Member

dhardy commented Jul 12, 2018

so I added a commit for that

Sneaky! But it saves more code than I realised due to all the import rules.

@sicking sicking deleted the limit_inexact_float branch July 12, 2018 15:19
@sicking
Copy link
Contributor Author

sicking commented Jul 12, 2018

Sneaky!

Sorry, intended to provide the option, not to sneak it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D-changes Do: changes requested D-review Do: needs review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants