Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WeightedIndex: Make it possible to update a subset of weights #866

Merged
merged 9 commits into from Aug 22, 2019

Conversation

vks
Copy link
Collaborator

@vks vks commented Aug 14, 2019

This could be useful for crates like droprate.

I had to add an additional field total_weight to WeightedIndex, which is redundant to the field weight_distribution. However, I cannot use the latter without making the information about the end of the sampled range public.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can add this.

return Err(WeightedError::InvalidWeight);
}
if i >= self.cumulative_weights.len() {
return Err(WeightedError::TooMany);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be worth adding InvalidIndex, except that it's a breaking change. Perhaps do so in a separate PR which we don't land until we start preparing the next Rand version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I though about this as well. Will do once this is merged.

src/distributions/weighted/mod.rs Outdated Show resolved Hide resolved
old_w -= &self.cumulative_weights[i - 1];
}

for j in i..self.cumulative_weights.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is O(n*m) where n = cumulative_weights.len() - min_index; m = new_weights.len().

Instead we should sort the new_weights by index, then apply in-turn (like in new); this is O(m*log(m) + n).

Also, we can just take total_weight = cumulative_weights.last().unwrap().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead we should sort the new_weights by index, then apply in-turn (like in new); this is O(m*log(m) + n).

I'll look into this.

Also, we can just take total_weight = cumulative_weights.last().unwrap().

I don't think so, the last cumulative weight is not stored in the vector. Or are you saying we should change it such that it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, binary_search_by is happy to return an index one-past-the-last-item, therefore the final weight is not needed. (And we have motive for not including the final weight: it guarantees we will never exceed the last index of the input weights list.)

Then yes, we need to store either the last weight or the total as an extra field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead we should sort the new_weights by index, then apply in-turn (like in new); this is O(m*log(m) + n).

I implemented that. It's a bit messy, because the the index type might be unsigned.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that did get messy! Eventually I convinced myself that your implementation is probably right.

Fortunately we can clean it up a lot (at the cost of two clones and one extra subtraction per un-adjusted weight). I think for all types we care about the clones will be cheap. Granted this is probably slower than your method when only updating a small subset of many indices, but I think not hugely slow and it's still O(n+m).

I'll leave it to your preference to require ordered input vs sorting.

Finally, do we need two loops? Only if we care about not changing self when given invalid parameters.

src/distributions/weighted/mod.rs Outdated Show resolved Hide resolved
src/distributions/weighted/mod.rs Outdated Show resolved Hide resolved
@vks
Copy link
Collaborator Author

vks commented Aug 16, 2019

I'll leave it to your preference to require ordered input vs sorting.

I think it is better to require sorted input, because usually it's trivial for the user to provide.

Finally, do we need two loops? Only if we care about not changing self when given invalid parameters.

The problem is that this would result in self being in an invalid state, which I wanted to avoid. (This would not be a problem if we would just panic.)

@vks
Copy link
Collaborator Author

vks commented Aug 16, 2019

I simplified the code as you suggested. The performance seems similar enough.

@dhardy
Copy link
Member

dhardy commented Aug 17, 2019

Thanks; then I think this is good to go. I won't have very much time available for this for a few weeks, so I'll leave you to merge.

@vks
Copy link
Collaborator Author

vks commented Aug 18, 2019

@dhardy Unfortunately, I'm not authorized to merge.

@dhardy
Copy link
Member

dhardy commented Aug 22, 2019

One timeout, one Redox failure. Good enough, I guess.

@dhardy dhardy merged commit 8616945 into rust-random:master Aug 22, 2019
@vks vks deleted the update-weights branch August 22, 2019 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants