Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New constructor with blocks #42

Merged
merged 4 commits into from Apr 16, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
63 changes: 63 additions & 0 deletions src/lib.rs
Expand Up @@ -67,6 +67,34 @@ impl FixedBitSet
length: bits,
}
}
/// Create a new **FixedBitSet** with a specific number of bits,
/// initialized from provided blocks.
///
/// If the blocks are not the exact size needed for the capacity
/// they will be padded with zeros (if shorter) or truncated to
/// the capacity (if longer).
pub fn with_capacity_and_blocks<I: IntoIterator<Item=Block>>(bits: usize, blocks: I) -> Self
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any "format" issue that we can't check with an assertion?

the doc comment needs update, also mentioning when and why construction would panic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be just as good to accept an IntoIterator of Blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this is how I'm using this PR: https://github.com/dib-lab/sourmash/blob/3c7de27fdfc23beedaa5173f370c93cb80a5a722/src/core/src/sketch/nodegraph.rs#L226L249

I thought the "format" issue would raise from using LittleEndian or BigEndian to build the blocks, but that's not really relevant to fixedbitset (because it only cares about it being u32), and it's only an issue when serializing the data from a buffer (a file, in my case).

On the panic side: I think failing makes sense, but a panic might be too harsh. Alternatives are:

  • not failing. Ignore extra blocks if too big, or put empty blocks if too short
  • changing the method to return a Result instead (but all current methods panic, so maybe this is not an issue).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic fits the common pattern of out of bounds => panic.

But you are right, if it's too short, we could just fill out with the right number of blocks and if too long, we could just save those as extra capacity (this case needs a brief walk through the code to check if that's compatible) - sounds like a no-panic solution can work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code looks a bit noisy, other things could add to your overhead there. Can you avoid calling individual read_u8 calls and so on?

Copy link
Contributor Author

@luizirber luizirber Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you are right, if it's too short, we could just fill out with the right number of blocks and if too long, we could just save those as extra capacity (this case needs a brief walk through the code to check if that's compatible) - sounds like a no-panic solution can work!

I removed the assert_eq! check and doing a .resize on the data now. Keeping the extra capacity involves changing other places, like the .as_slice method (which will return something potentially larger than the initialized capacity), and there is also the need to reset any extra capacity too.

I kept a failing test to discuss this:

let fb = FixedBitSet::with_capacity_and_blocks(1, vec![8u32, 24u32]);
assert!(!fb.contains(3));

since the capacity is 1, the .contains(3) call should be false. Right now it is true, because that first block is 8u32.

So, there is an extra step of setting any bit > capacity to 0 to make the test pass.

Or should it be panicking with out of bounds (since it is more than the capacity)?

Your code looks a bit noisy, other things could add to your overhead there. Can you avoid calling individual read_u8 calls and so on?

working on that, it was much, much worse...
(I was parsing everything with read_u8, and collecting set bits, and then doing a .from_iter before...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the capacity is 1, the .contains(3) call should be false. Right now it is true, because that first block is 8u32.

It's really a format error that the blocks contain something else than zero past the length. This is something fixedbitset assumes, and it is not 100% enforced in the interface.

It should probably not panic for oob - any bit outside our capacity is assumed to be not set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled the bits past capacity, thanks for the patience!

let (mut n_blocks, rem) = div_rem(bits, BITS);
n_blocks += (rem > 0) as usize;
let mut data: Vec<Block> = blocks.into_iter().collect();
// Pad data with zeros if smaller or truncate if larger
if data.len() != n_blocks {
data.resize(n_blocks, 0);
}
// Disable bits in blocks beyond capacity
let end = data.len() * 32;
for (block, mask) in Masks::new(bits..end, end) {
unsafe {
*data.get_unchecked_mut(block) &= !mask;
}
}
FixedBitSet {
data: data,
length: bits,
}
}

/// Grow capacity to **bits**, all new bits initialized to zero
pub fn grow(&mut self, bits: usize) {
let (mut blocks, rem) = div_rem(bits, BITS);
Expand Down Expand Up @@ -705,6 +733,41 @@ fn it_works() {
fb.clear();
}

#[test]
fn with_blocks() {
let fb = FixedBitSet::with_capacity_and_blocks(50, vec![8u32, 0u32]);
assert!(fb.contains(3));

let ones: Vec<_> = fb.ones().collect();
assert_eq!(ones.len(), 1);
}

#[test]
fn with_blocks_too_small() {
let mut fb = FixedBitSet::with_capacity_and_blocks(500, vec![8u32, 0u32]);
fb.insert(400);
assert!(fb.contains(400));
}

#[test]
fn with_blocks_too_big() {
let fb = FixedBitSet::with_capacity_and_blocks(1, vec![8u32]);

// since capacity is 1, 3 shouldn't be set here
assert!(!fb.contains(3));
}

#[test]
fn with_blocks_too_big_range_check() {
let fb = FixedBitSet::with_capacity_and_blocks(1, vec![0xff]);

// since capacity is 1, only 0 should be set
assert!(fb.contains(0));
for i in 1..0xff {
assert!(!fb.contains(i));
}
}

#[test]
fn grow() {
let mut fb = FixedBitSet::with_capacity(48);
Expand Down