Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweak f64 deserialization to make float roundtrips work #671

Merged
merged 4 commits into from Jun 7, 2020

Conversation

japaric
Copy link
Contributor

@japaric japaric commented May 13, 2020

This is a new attempt at fixing issue #536 using the latest version of @Alexhuszagh 's minimal-lexical library for deserialization of f64 values. The goal is to make float roundtrip work without severely impacting compile times.

I have:

  • checked that the crate still compiles with Rust 1.31
  • added a sample of roundtrip tests that fail with the current deserializer but pass with this branch (also did some testing with quickcheck but didn't add that here as it would bring it a new dev-dependency)
  • ran this loop test for over 45 minutes
  • measured changes in compile time using perf (see table below)
  • benchmarked this branch and compared it against master (results at the bottom)

Note that:

  • as used here (with default features off), the minimal-lexical crate itself does not have any dependencies
  • but minimal-lexical will use a heap-allocated Vec for its internal, intermediate buffer instead of stack allocated one (which would bring in the arrayvec dependency)
  • there's also the option of enabling the arrayvec feature or writing a custom stack allocated Vec that doesn't bring any extra dependencies -- though the heap-allocated Vec doesn't seem to degrade performance as measured by json-benchmark

compile time changes

(compiler used: 1.43.1)

I have measured perf stat cargo $command on the serde-json crate after a cargo clean for these values of $command: check, build, build --release and a faster version of build --release. The table reports the changes in instructions:u both in absolute value and as a percentage. The table also includes the initial and the change in wall time, to get a more "human" feel for the change. I ran perf a few times for each command; the values reported here are median values.

Command Δ insn Time (master) Δ Time
check +1.848G (+1.6%) 14.584s +0.232s (+1.59%)
build +3.442G (+2.75%) 15.471 +0.323s (+2.09%)
b --release +6.004G (+1.46%) 30.432 +0.675s (+2.219%)
b --release++ +6.219 (+5.28%) 14.729 +0.31s (+2.106%)

The faster build --release consists of adding the profile override shown below to compile all build scripts and proc macros without optimizations.

[profile.release.build-override]
codegen-units = 16
debug = false
debug-assertions = false
opt-level = 0
overflow-checks = false

(off-topic: Your mileage may vary though with this profile trick. I known that un-optimized bindgen runs very slowly so this actually backfires when you have that in your dependency graph (or at least that was the case many months ago; haven't use bindgen with embedded code in a long time))

runtime perf

I ran json-benchmark with this branch and master and got these parse results. Again, all values are median values out of a few runs.

Dataset DOM - master DOM - PR STRUCT - master STRUCT - PR
canada 250 MB/s 260 MB/s 480 MB/s 480 MB/s
citm_catalog 380 MB/s 400 MB/s 830 MB/s 810 MB/s
twitter 280 MB/s 280 MB/s 610 MB/s 620 MB/s

These values fluctuated quite a bit so I wouldn't say this PR improves or degrades performance.

@dtolnay do these changes in compile time, in exchange for the round-trip correctness, seem like an acceptable trade-off to you?

src/de.rs Show resolved Hide resolved
@rw
Copy link

rw commented May 13, 2020

WDYT about testing every f32? :-]

@japaric
Copy link
Contributor Author

japaric commented May 14, 2020

WDYT about testing every f32? :-]

let start = Instant::now();
static N: AtomicU32 = AtomicU32::new(0);
(0..=u32::MAX).into_par_iter().for_each(|i| {
    let input = f32::from_bits(i) as f64;
    if input.is_finite() {
        let json = serde_json::to_string(&input).unwrap();
        let output: f64 = serde_json::from_str(&json).unwrap();
        if input != output {
            let x = N.fetch_add(1, Ordering::Relaxed);
            println!("errors: {} with {:#010x}", x + 1, i);
        }
    }
});
let end = Instant::now();

println!("errors: {}", N.load(Ordering::Relaxed));
println!("DONE in {:?}", end - start);
DONE in 380.674042494s

That works fine but takes 6-7 minutes so probably way too long to run on CI (which probably has less parallelism).

If I deserialize/serialize to f32 instead of f64 I get two errors though:

-        let input = f32::from_bits(i) as f64;
+        let input = f32::from_bits(i);
         if input.is_finite() {
             let json = serde_json::to_string(&input).unwrap();
-            let output: f64 = serde_json::from_str(&json).unwrap();
+            let output: f32 = serde_json::from_str(&json).unwrap();
errors: 1 with 0x95ae43fd
errors: 2 with 0x15ae43fd

Will look into the errors.

@japaric
Copy link
Contributor Author

japaric commented May 14, 2020

errors: 1 with 0x95ae43fd
errors: 2 with 0x15ae43fd

OK, I see the issue. when serializing we use the f32 variant of ryu to serialize f32 values and the f64 variant to deal with f64 inputs. when deserializing everything goes through the 64-bit variant of minimal-lexical and then output gets casted to an f32 if the user requested that; doing it like that results in an error of 1 ULP for 0x15ae43fd (0x95ae43fd is the negated value).

using the 32-bit variant of minimal-lexical (on the ryu-stringified value of 0x15ae43fd) does produce the correct value so we should use that one if the requested output is the f32 type.

EDIT: added some code to visualize this better.

let x = 0x15ae43fd;
let y = f32::from_bits(x);
println!("u32: {:#010x}", x);
println!("std: {}", y);
println!("ryu: {}", ryu::Buffer::new().format_finite(y));
let a = minimal_lexical::parse_float::<f32, _, _>(b"7".iter(), b"038531".iter(), -26);
let b = minimal_lexical::parse_float::<f64, _, _>(b"7".iter(), b"038531".iter(), -26) as f32;
println!("ml32: {} ({:#010x})", a, a.to_bits());
println!("ml64: {} ({:#010x})", b, b.to_bits());
u32: 0x15ae43fd
std: 0.00000000000000000000000007038531
ryu: 7.038531e-26
ml32: 0.00000000000000000000000007038531 (0x15ae43fd)
ml64: 0.000000000000000000000000070385313 (0x15ae43fe)

instead of using the 64-bit version and then casting down to f32
@japaric
Copy link
Contributor Author

japaric commented May 14, 2020

we should use that one if the requested output is the f32 type.

Done, f32 roundtrip is now working for all finite f32 values (and the program seems to terminate a bit faster?). This change will bring in the f32 monophormization of parse_float; that seems to increase the # of instructions reported by perf stat cargo build by ~0.13%.

@rw
Copy link

rw commented May 15, 2020

@japaric The serde_json::to_string is probably allocating on every execution. I'd try re-using a String as a buffer, instead.

@japaric
Copy link
Contributor Author

japaric commented May 15, 2020

ah, yes. You are right. Deserialization also allocates because Deserializer contains a Vec (that was there before this PR). I don't see a way to re-use a Deserializer to avoid that Vec being allocated and deallocated each time a value is deserialized though, at least not using the public API.

Using to_writer, testing the whole 32-bit range takes ~11 minutes in single-threaded mode, ~5 minutes with 8 threads (using just std::thread and no external dependencies) and ~6 minutes with 4 threads -- I guess more threads don't help in the program I wrote because there's no load balancing. I can commit either version and if the test is deemed to take too long it could be run on a daily basis using cron (rather than on each commit / PR).

EDIT: for comparison, the rayon + to_string version also take ~5 minutes.

@rw
Copy link

rw commented May 15, 2020

@japaric Could you post your updated "zero-alloc in the hot sections" version?

@japaric
Copy link
Contributor Author

japaric commented May 18, 2020

const NTHREADS: u32 = 8;
const STEP: u32 = u32::MAX / NTHREADS;

let mut threads = vec![];
for i in 0..(NTHREADS - 1) {
    threads.push(thread::spawn(move || {
        let mut json = vec![];

        let start = Instant::now();
        for j in i * STEP..(i + 1) * STEP {
            let input = f32::from_bits(j);
            if input.is_finite() {
                json.clear();
                serde_json::to_writer(&mut json, &input).unwrap();
                let output: f32 = serde_json::from_slice(&json).unwrap();
                assert_eq!(input, output);
            }
        }
        println!("({}) DONE in {:?}", i, start.elapsed());
    }));
}

let start = Instant::now();
let mut json = vec![];
for j in (NTHREADS - 1) * STEP..=u32::MAX {
    let input = f32::from_bits(j);
    if input.is_finite() {
        json.clear();
        serde_json::to_writer(&mut json, &input).unwrap();
        let output: f32 = serde_json::from_slice(&json).unwrap();
        assert_eq!(input, output);
    }
}
println!("({}) DONE in {:?}", NTHREADS - 1, start.elapsed());

threads.into_iter().for_each(|t| t.join().unwrap());

is what I tried. It doesn't do load balancing but all threads complete in about the same amount of time (270-300s) so maybe load balancing is not too important.

@japaric
Copy link
Contributor Author

japaric commented May 26, 2020

@dtolnay did you have a chance to look at this PR? 👀

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I wasn't immediately able to reproduce your runtime perf numbers from the PR description. I mainly care about canada.json for now since that's the one that contains floats. For that file you got pretty much identical perf before and after this PR. On my laptop I get:

Dataset DOM - master DOM - PR STRUCT - master STRUCT - PR
canada 280 MB/s 130 MB/s 490 MB/s 170 MB/s

which is a pretty big slowdown, and reproduces on 3 machines I tried. Here is the build command I am using:

CARGO_INCREMENTAL=0 \
    RUSTFLAGS='-C codegen-units=1 -C target-cpu=native' \
    cargo build \
    --release \
    --no-default-features \
    --features parse-dom,parse-struct,lib-serde,all-files

Can you double check that your benchmark result is actually using the changes from the PR?

Less importantly, I noticed this extremely suspicious code in minimal-lexical which blindly writes past the end of a Vec. If there is some precondition on the capacity of the Vec there, then insert_many would need to be made unsafe to call, otherwise it seems like insert_many should take care of raising the capacity of the Vec when needed.

On this comment:

  • there's also the option of enabling the arrayvec feature or writing a custom stack allocated Vec that doesn't bring any extra dependencies -- though the heap-allocated Vec doesn't seem to degrade performance as measured by json-benchmark

It's possible the reason no difference was observed is that the no_alloc Cargo feature in minimal-lexical looks like it currently doesn't do anything, since this code is looking at cfg(no_alloc) rather than cfg(feature = "no_alloc"). We should try it with RUSTFLAGS='--cfg no_alloc' + --features minimal-lexical/no_alloc.

@@ -16,6 +16,7 @@ edition = "2018"
serde = { version = "1.0.100", default-features = false }
indexmap = { version = "1.2", optional = true }
itoa = { version = "0.4.3", default-features = false }
minimal-lexical = { git = "https://github.com/Alexhuszagh/minimal-lexical" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to use default-features = false and then have our std feature in this crate enable "minimal-lexical/std". Otherwise there would end up being a std dependency through minimal-lexical even when serde_json/std is not enabled. I added a CI build in d6080d9 that would have caught this.

   Compiling minimal-lexical v0.1.0 (https://github.com/Alexhuszagh/minimal-lexical#fa1e491b)
error[E0463]: can't find crate for `std`

@dtolnay
Copy link
Member

dtolnay commented Jun 7, 2020

Courtesy of libFuzzer, here is an input that triggers the out of bounds write:

fn main() {
    let j = "144440.000312571602869014401444400000312571602869014031257160286901444400000312144440000031257160286901444444440000312571602869014444000003125716028690144440000031257160286901444400000312571602869014444000031250312571602312571602860003125716023125716028694400003125716028690144440000031254400000312571602869014444000003125716028690144440000312503125716023125716028600031257160231257160286";
    let _ = serde_json::from_str::<f64>(j);
}
==41341== Invalid write of size 8
==41341==    at 0x4841C24: memmove (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==41341==    by 0x123B6C: core::intrinsics::copy (intrinsics.rs:2127)
==41341==    by 0x11E02F: minimal_lexical::math::insert_many (math.rs:278)
==41341==    by 0x12EEBF: minimal_lexical::math::small::ishl_limbs (math.rs:664)
==41341==    by 0x12F154: minimal_lexical::math::small::ishl (math.rs:678)
==41341==    by 0x12CA89: minimal_lexical::math::Math::ishl (math.rs:1046)
==41341==    by 0x12CC2D: minimal_lexical::math::Math::imul_pow2 (math.rs:1025)
==41341==    by 0x127626: minimal_lexical::bhcomp::small_atof (bhcomp.rs:176)
==41341==    by 0x128778: minimal_lexical::bhcomp::bhcomp (bhcomp.rs:225)
==41341==    by 0x121DC0: minimal_lexical::algorithm::fallback_path (algorithm.rs:176)
==41341==    by 0x11BF6B: minimal_lexical::parse::parse_float (parse.rs:80)
==41341==    by 0x11657A: serde_json::de::Deserializer<R>::f64_from_parts (de.rs:752)
==41341==  Address 0x4ab8430 is 0 bytes after a block of size 160 alloc'd
==41341==    at 0x483A7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==41341==    by 0x13790B: alloc::alloc::alloc (alloc.rs:80)
==41341==    by 0x137E23: <alloc::alloc::Global as core::alloc::AllocRef>::alloc (alloc.rs:174)
==41341==    by 0x133E50: alloc::raw_vec::finish_grow (raw_vec.rs:514)
==41341==    by 0x1348DA: alloc::raw_vec::RawVec<T,A>::grow_amortized (raw_vec.rs:440)
==41341==    by 0x134346: alloc::raw_vec::RawVec<T,A>::try_reserve (raw_vec.rs:281)
==41341==    by 0x134B22: alloc::raw_vec::RawVec<T,A>::reserve (raw_vec.rs:267)
==41341==    by 0x130998: alloc::vec::Vec<T>::reserve (vec.rs:505)
==41341==    by 0x1387B3: minimal_lexical::math::reserve (math.rs:336)
==41341==    by 0x1323E0: <minimal_lexical::bignum::Bigint as core::default::Default>::default (bignum.rs:16)
==41341==    by 0x12CB30: minimal_lexical::math::Math::from_u64 (math.rs:991)
==41341==    by 0x12752D: minimal_lexical::bhcomp::small_atof (bhcomp.rs:145)

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going ahead and merging this since it's pretty clear that an accurate float parser is worth having. I think the remaining work will be looking into whether we can recover some of the performance and then making a call as to whether this should be behind a feature and whether that feature should be on or off by default.

Thanks all!

@dtolnay dtolnay merged commit adafa93 into serde-rs:master Jun 7, 2020
@rw
Copy link

rw commented Jun 12, 2020

@dtolnay Thanks! Should we create a new issue to dig into the unsafe usage in minimal-lexical?

@Alexhuszagh
Copy link

Alexhuszagh commented Apr 27, 2021

Thanks for catching this. I just fuzzed minimal-lexical and this appeared as well, which has been patched in lexical upstream (by replacing the insert_many algorithm as per Alexhuszagh/rust-lexical#53. This was somehow never an issue in rust-lexical at all, but in porting lexical to minimal-lexical, I apparently deleted that line due to the assumption of a stack allocated buffer.

@rw This should be patched extensively in minimal-lexical, since the only unsafe usage usage should have been that algorithm. Still, a great reason why to fuzz libraries, and give a second eye, especially when porting code. Thank you. This would have been a very serious issue, so thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants