Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arithmatic overflow leads to segfault in concat_batches #3123

Closed
msalib opened this issue Nov 16, 2022 · 3 comments · Fixed by #3157
Closed

arithmatic overflow leads to segfault in concat_batches #3123

msalib opened this issue Nov 16, 2022 · 3 comments · Fixed by #3157
Labels
arrow Changes to the arrow crate bug

Comments

@msalib
Copy link
Contributor

msalib commented Nov 16, 2022

Describe the bug

I can reliably reproduce a bug where concat_batches performs an addition that overflows (see https://github.com/apache/arrow-rs/blob/master/arrow-data/src/transform/utils.rs#L39). In release mode, the result for my application and data is a segfault on linux. When I run the same application in debug mode, I get a panic since checked arithmetic panics in debug mode.

To Reproduce

I've tried to minimize this as much as I can. Checkout the repo https://github.com/msalib/broken-arrow and run the following command:

RUST_BACKTRACE=full cargo run --release -- redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet

You should see a panic with backtrace (I'm running on x86-64 linux). That repo includes a Cargo.toml which enables overflow-checks in release builds because I got tired of waiting for slower debug builds. If you run the same command with just one less copy of redacted.parquet, it will succeed without panicing.

Expected behavior

The code should not segfault. Either the arithmetic should be checked for overflow so we get a panic or (much more preferable) the overflow should be avoided.

Additional context

Here's the full backtrace from my application (code and data not included here because it is proprietary, but we're using color-eyre and tracing so we get more detailed backtrace):

  10: core::panicking::panic::hb3ad04c589a0e3c8
      at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/panicking.rs:48
  11: <i32 as core::ops::arith::Add>::add::h9c2a7b4ebe56cff6
      at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/ops/arith.rs:126
  12: arrow_data::transform::utils::extend_offsets::{{closure}}::hd8682bb256e4642b
      at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-data-24.0.0/src/transform/utils.rs:39
        37 │         // compute the new offset
        38 │         let length = offsets[1] - offsets[0];
        39 >         last_offset = last_offset + length;
        40 │         buffer.push(last_offset);
        41 │     });
  13: core::iter::traits::iterator::Iterator::for_each::call::{{closure}}::h0ad46e165b780cd0
      at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:828
  14: core::iter::traits::iterator::Iterator::fold::hf19d319b4c63662a
      at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:2414
  15: core::iter::traits::iterator::Iterator::for_each::hdc528a3ba10831bd
      at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:831
  16: arrow_data::transform::utils::extend_offsets::h2e5dcad89a1d4f4a
      at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-data-24.0.0/src/transform/utils.rs:36
        34 │ ) {
        35 │     buffer.reserve(offsets.len() * std::mem::size_of::<T>());
        36 >     offsets.windows(2).for_each(|offsets| {
        37 │         // compute the new offset
        38 │         let length = offsets[1] - offsets[0];
  17: arrow_data::transform::variable_size::build_extend::{{closure}}::ha20bde03e4f6553e
      at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-data-24.0.0/src/transform/variable_size.rs:57
        55 │                 let last_offset = unsafe { get_last_offset(offset_buffer) };
        56 │ 
        57 >                 extend_offsets::<T>(
        58 │                     offset_buffer,
        59 │                     last_offset,
  18: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h6cf66436bc24cef4
      at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/alloc/src/boxed.rs:1949
  19: arrow_data::transform::MutableArrayData::extend::h13a9c5461737bf0f
      at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-data-24.0.0/src/transform/mod.rs:603
       601 │         let len = end - start;
       602 │         (self.extend_null_bits[index])(&mut self.data, start, len);
       603 >         (self.extend_values[index])(&mut self.data, index, start, len);
       604 │         self.data.len += len;
       605 │     }
  20: arrow::compute::kernels::concat::concat::h1efb0d4d54ee8ec9
      at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-24.0.0/src/compute/kernels/concat.rs:100
        98 │ 
        99 │     for (i, len) in lengths.iter().enumerate() {
       100 >         mutable.extend(i, 0, *len)
       101 │     }
       102 │ 
  21: arrow::compute::kernels::concat::concat_batches::ha80396bf3c3d8b6c
      at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-24.0.0/src/compute/kernels/concat.rs:127
       125 │     let mut arrays = Vec::with_capacity(field_num);
       126 │     for i in 0..field_num {
       127 >         let array = concat(
       128 │             &batches
       129 │                 .iter()

Thanks so much! We rely heavily on arrow+parquet and they've made our lives a lot easier!

@msalib msalib added the bug label Nov 16, 2022
@msalib
Copy link
Contributor Author

msalib commented Nov 16, 2022

More investigation suggests that this happens when last_offset = 2147483550 and length = 543, so we're overflowing an i32, which is an odd choice for an index type....

@tustvold
Copy link
Contributor

Thank you for the report and reproducer, I plan to look into this next week unless somebody else gets there first. I suspect MutableArrayData is not doing a checked operation when it should be - this should return an error, or at the very least panic.

As for the use of i32, this is a Javaism that unfortunately made its way into the arrow specification. You could try using LargeUtf8 which uses i64, but you may also want to use smaller batches.

tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 22, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 22, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 22, 2022
tustvold added a commit that referenced this issue Nov 22, 2022
* Check overflow in MutableArrayData extend offsets (#3123)

* Update arrow-data/src/transform/utils.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Nov 25, 2022

label_issue.py automatically added labels {'arrow'} from #3157

@alamb alamb added the arrow Changes to the arrow crate label Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants