Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array --> Display formatter that supports more options and is configurable #3638

Closed
4 tasks
alamb opened this issue Jan 30, 2023 · 3 comments
Closed
4 tasks
Assignees
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Jan 30, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There are several places where Arrays and record RecordBatches need to be converted to Strings for display or output purposes which are inconsistent and somewhat inefficient.

The core function is array_value_to_string https://docs.rs/arrow/31.0.0/arrow/util/display/fn.array_value_to_string.html

Note this function is quite inefficient and is unlikely to be suitable for converting large arrays or record batches.

One example of inconsistency is that when pretty printing https://docs.rs/arrow/31.0.0/arrow/util/pretty/index.html and writing out CSV it is possible to change the display formatting for data / timestamps but such control is not possible in array_value_to_string or while pretty printing.

The current state of the API makes it difficult to add customization like different representations for the NULL value -- see #2474 (cc @JasonLi-cn)

Describe the solution you'd like

What I would like is a configurable array formatter where formatting options can be specified and that does not need to return allocated Strings (but could write directly to a stream if necessary).

Here is how I would like to use it

// configure a formatter object that uses 'NULL' for the null value
// and prints out dates  like "1970-1-2" 
let formatter = ArrayFormatter::new()
  .with_null("NULL")
  .with_date_format("%Y-%m-%d");

// format an array cell
println!("array at index 42: {}", formatter.format_array(arr, 42));

// format an entire record batch using the same settings
// looks like the output of `pretty_format_batch`
println!("record batch:\n{}", formatter.format_batch(&batch))

The type signatures would look like

struct ArrayFormatter {
...
}

impl ArrayFormatter {

  /// Format `row` of `array`
  /// 
  /// returns `impl Display` to allow printing to `Write` without copying
  pub fn format_array(arr: &dyn Array, row: usize) -> Result<impl Display> {
    ...
  }

  /// Format `batch` into a pretty format using the formatting options
  /// 
  /// returns `impl Display` to support printing to `Write` without copying
  pub fn format_batch(batch: &RecordBatch) -> Result<impl Display> {
    ...
  }

...
}

Describe alternatives you've considered

Additional context

Pretty much all of the formatting implementation already exists in the display module after @JayjeetAtGithub 's great work to reduce discrepancies in the CSV writer #3514. I think the core part

To implement this feature, I recommend a series of smaller PRs:

  • starting with creating the basic API that has simple but inefficient implementation that calls existing methods in display
  • Switch the implementations of display to avoid creating intermediate Strings`
  • migrate CSV writer to use the new formatter
  • Add additional features like "null" format support

I think this is a good first issue as the desired API is spec'd out, and formatting logic already exists -- it just needs to be unified into a new API.

@alamb alamb added good first issue Good for newcomers arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Jan 30, 2023
@tustvold
Copy link
Contributor

I think I would like to pick this one up, as I have opinions on what this should look like, and I want to take the opportunity to clean up the current code

@tustvold tustvold self-assigned this Jan 30, 2023
@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2023

Another thing I would like out of this API would be a more natural avenue for streaming record batch conversion (as in https://docs.rs/arrow/31.0.0/arrow/util/pretty/fn.pretty_format_batches.html requires me to buffer up multiple batches of conversion which causes pain in the influxdb_iox client, among other things)

tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 1, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 1, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 1, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 2, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 3, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 4, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 4, 2023
tustvold added a commit that referenced this issue Feb 6, 2023
* Implement std::fmt::Write for StringBuilder (#3638)

* Add docs
tustvold added a commit that referenced this issue Feb 8, 2023
* Lazy array display (#3638)

* Update CSV writer

* Borrow

* Time formatting

* Update pretty

* Add FixedSizeBinaryArray

* Further tweaks

* Clippy

* More clippy

* More tweaks

* More clippy

* Clippy

* Use lexical_core

* Update doctest

* Review feedback

* Bump CI

* Review feedback
@tustvold tustvold added the parquet Changes to the parquet crate label Feb 24, 2023
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #3647

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants