Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] double free or memory corruption when parsing some JSON #15750

Open
revans2 opened this issue May 14, 2024 · 6 comments
Open

[BUG] double free or memory corruption when parsing some JSON #15750

revans2 opened this issue May 14, 2024 · 6 comments
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@revans2
Copy link
Contributor

revans2 commented May 14, 2024

Describe the bug

We recently saw a SIGSEGV when trying to parse some customer data using JSON. As I stripped down and obfuscated the data it switched to the process aborting with the error.

double free or corruption (!prev)
Aborted (core dumped)

The code to reproduce this is

TEST_F(JsonReaderTest, JsonLinesCrash)
{
  std::string const fname = "from_json_1_10_simplified.json";

  cudf::io::json_reader_options options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{fname})
    .lines(true)
    .recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
    .normalize_single_quotes(true)
    .normalize_whitespace(true)
    .mixed_types_as_string(true)
    .keep_quotes(true);

  auto result = cudf::io::read_json(options);
}

and the file is
from_json_1_10_simplified.json

@revans2 revans2 added the bug Something isn't working label May 14, 2024
@revans2
Copy link
Contributor Author

revans2 commented May 14, 2024

Just be aware that I have hit other issues with this data too. IllegalMemoryAccess errors, a hang where it looks like we are in an infinite loop, and exceptions saying that the data looks to be incorrectly formatted (top level arrays mixed with structs appears to make this very unhappy).

I am happy to share more data/failures as needed. It just takes a lot of work to obfuscate it properly and still have an appropriate repro case. I suspect that they are almost all related to each other in some form.

@shrshi
Copy link
Contributor

shrshi commented May 14, 2024

Thank you for the repro and the dataset, @revans2
I'm investigating the segfault and get back to you soon.

@shrshi shrshi self-assigned this May 14, 2024
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels May 15, 2024
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone May 15, 2024
@karthikeyann
Copy link
Contributor

I ran via compute-sanitizer with the repro code. but there are no CUDA errors.
I ran with gdb, and got line of segfault.

#16 operator()<int, cudf::io::json::device_json_column> (col=..., i=16, __closure=<synthetic pointer>) at /home/coder/cudf/cpp/src/io/json/json_column.cu:597
597         col.child_columns.clear();  // their references should be deleted too.

If this line is commented out, no segfault occurs.

@karthikeyann
Copy link
Contributor

Minimal repro with valgrind errors

{ "Root": { "Key": [ { "EE": "A" } ] } }
{ "Root": { "Key": {  } } }
{ "Root": { "Key": [{ "YY": 1}] } }

Issue is not in the clear() function call, but accessing deleted reference of col. when clear() is called, the children columns are deleted, but their references still are stored in columns. That's what leads to the segfault, it tries to access a deleted object's reference. This algorithm could be rewritten as graphs algorithm to avoid this issue.

@karthikeyann
Copy link
Contributor

karthikeyann commented May 21, 2024

Created a fix for memory errors #15798. Long term fix should be rewriting the device tree creation algorithm

@karthikeyann
Copy link
Contributor

karthikeyann commented May 21, 2024

Discussed with @shrshi offline about the bug and need for refactor of make_device_json_column() in json_column.cu#L477

make_device_json_column function's complexity increased after addition of more features, and the current algorithm does not use any graph data-structure. It uses vectors to keep track of ignoring columns, mixed types, pruned columns and uses references. This code could be refactored, so it will be easy to add new features too.

Here are the list of features that is currently supported by make_device_json_column function and few other features that needs to be supported.

  1. constructs nested tree from parent_col_id vector
  2. Handle array of arrays too (pandas feature) (eg. [[1, 2, 3], [4, 5, 6]])
  3. null literals are ignored (pandas)
  4. mixed types support. (spark)
    • current implementation returns mixed types as strings. @revans2 clarified that mixed types need to return requested type and ignore all other types. (Eg. if list is requested, all struct, string, values needs to be ignored for same column path).
  5. Request any column as string type and corresponding children columns should be ignored (no extra memory, no extra processing). (spark)
  6. Column pruning
  7. TODO: optimization to batch the init_to_zero call with cub::DeviceMemcpy::Batched (consider batching scan of offsets too).
  8. OPTIONAL: To support map type. (Need more details about requirements from spark team).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Status: In progress
Development

No branches or pull requests

5 participants
@revans2 @karthikeyann @shrshi @GregoryKimball and others