Skip to content

Commit

Permalink
[Minor] Improve arrow and parquet READMEs, document parquet feature f…
Browse files Browse the repository at this point in the history
…lags (#2324)

* Improve arrow and parquet READMEs, document parquet feature flags

* Fixup

* Move performance tips to crates.io and leave a link

* Add link back to crates.io from lib.rs
  • Loading branch information
alamb committed Aug 5, 2022
1 parent 297a8fa commit 8e30d06
Show file tree
Hide file tree
Showing 5 changed files with 88 additions and 66 deletions.
81 changes: 56 additions & 25 deletions arrow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,10 @@
[![crates.io](https://img.shields.io/crates/v/arrow.svg)](https://crates.io/crates/arrow)
[![docs.rs](https://img.shields.io/docsrs/arrow.svg)](https://docs.rs/arrow/latest/arrow/)

This crate contains the official Native Rust implementation of [Apache Arrow][arrow] in memory format, governed by the Apache Software Foundation. Additional details can be found on [crates.io](https://crates.io/crates/arrow), [docs.rs](https://docs.rs/arrow/latest/arrow/) and [examples](https://github.com/apache/arrow-rs/tree/master/arrow/examples).
This crate contains the official Native Rust implementation of [Apache Arrow][arrow] in memory format, governed by the Apache Software Foundation.

The [crate documentation](https://docs.rs/arrow/latest/arrow/) contains examples and full API.
There are several [examples](https://github.com/apache/arrow-rs/tree/master/arrow/examples) to start from as well.

## Rust Version Compatibility

Expand All @@ -34,18 +37,24 @@ The arrow crate follows the [SemVer standard](https://doc.rust-lang.org/cargo/re

However, for historical reasons, this crate uses versions with major numbers greater than `0.x` (e.g. `19.0.0`), unlike many other crates in the Rust ecosystem which spend extended time releasing versions `0.x` to signal planned ongoing API changes. Minor arrow releases contain only compatible changes, while major releases may contain breaking API changes.

## Features
## Feature Flags

The arrow crate provides the following features which may be enabled:
The `arrow` crate provides the following features which may be enabled in your `Cargo.toml`:

- `csv` (default) - support for reading and writing Arrow arrays to/from csv files
- `ipc` (default) - support for the [arrow-flight](https://crates.io/crates/arrow-flight) IPC and wire format
- `prettyprint` - support for formatting record batches as textual columns
- `js` - support for building arrow for WebAssembly / JavaScript
- `simd` - (_Requires Nightly Rust_) alternate optimized
- `simd` - (_Requires Nightly Rust_) Use alternate hand optimized
implementations of some [compute](https://github.com/apache/arrow-rs/tree/master/arrow/src/compute/kernels)
kernels using explicit SIMD instructions available through [packed_simd_2](https://docs.rs/packed_simd_2/latest/packed_simd_2/).
kernels using explicit SIMD instructions via [packed_simd_2](https://docs.rs/packed_simd_2/latest/packed_simd_2/).
- `chrono-tz` - support of parsing timezone using [chrono-tz](https://docs.rs/chrono-tz/0.6.0/chrono_tz/)
- `ffi` - bindings for the Arrow C [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
- `pyarrow` - bindings for pyo3 to call arrow-rs from python

## Arrow Feature Status

The [Apache Arrow Status](https://arrow.apache.org/docs/status.html) page lists which features of Arrow this crate supports.

## Safety

Expand All @@ -55,25 +64,25 @@ Arrow seeks to uphold the Rust Soundness Pledge as articulated eloquently [here]
Where soundness in turn is defined as:

> Code is unable to trigger undefined behaviour using safe APIs
> Code is unable to trigger undefined behavior using safe APIs
One way to ensure this would be to not use `unsafe`, however, as described in the opening chapter of the [Rustonomicon](https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html) this is not a requirement, and flexibility in this regard is actually one of Rust's great strengths.
One way to ensure this would be to not use `unsafe`, however, as described in the opening chapter of the [Rustonomicon](https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html) this is not a requirement, and flexibility in this regard is one of Rust's great strengths.

In particular there are a number of scenarios where `unsafe` is largely unavoidable:

* Invariants that cannot be statically verified by the compiler and unlock non-trivial performance wins, e.g. values in a StringArray are UTF-8, [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) iterators, etc...
* FFI
* SIMD
- Invariants that cannot be statically verified by the compiler and unlock non-trivial performance wins, e.g. values in a StringArray are UTF-8, [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) iterators, etc...
- FFI
- SIMD

Additionally, this crate exposes a number of `unsafe` APIs, allowing downstream crates to explicitly opt-out of potentially expensive invariant checking where appropriate.
Additionally, this crate exposes a number of `unsafe` APIs, allowing downstream crates to explicitly opt-out of potentially expensive invariant checking where appropriate.

We have a number of strategies to help reduce this risk:

* Provide strongly-typed `Array` and `ArrayBuilder` APIs to safely and efficiently interact with arrays
* Extensive validation logic to safely construct `ArrayData` from untrusted sources
* All commits are verified using [MIRI](https://github.com/rust-lang/miri) to detect undefined behaviour
* We provide a `force_validate` feature that enables additional validation checks for use in test/debug builds
* There is ongoing work to reduce and better document the use of unsafe, and we welcome contributions in this space
- Provide strongly-typed `Array` and `ArrayBuilder` APIs to safely and efficiently interact with arrays
- Extensive validation logic to safely construct `ArrayData` from untrusted sources
- All commits are verified using [MIRI](https://github.com/rust-lang/miri) to detect undefined behaviour
- Use a `force_validate` feature that enables additional validation checks for use in test/debug builds
- There is ongoing work to reduce and better document the use of unsafe, and we welcome contributions in this space

## Building for WASM

Expand Down Expand Up @@ -101,16 +110,38 @@ cargo run --example read_csv

[arrow]: https://arrow.apache.org/

## Performance Tips

## Performance
Arrow aims to be as fast as possible out of the box, whilst not compromising on safety. However,
it relies heavily on LLVM auto-vectorisation to achieve this. Unfortunately the LLVM defaults,
particularly for x86_64, favour portability over performance, and LLVM will consequently avoid
using more recent instructions that would result in errors on older CPUs.

Most of the compute kernels benefit a lot from being optimized for a specific CPU target.
This is especially so on x86-64 since without specifying a target the compiler can only assume support for SSE2 vector instructions.
One of the following values as `-Ctarget-cpu=value` in `RUSTFLAGS` can therefore improve performance significantly:
To address this it is recommended that you specify the override the LLVM defaults either
by setting the `RUSTFLAGS` environment variable, or by setting `rustflags` in your
[Cargo configuration](https://doc.rust-lang.org/cargo/reference/config.html)

- `native`: Target the exact features of the cpu that the build is running on.
This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software.
- `x86-64-v3`: Includes AVX2 support and is close to the intel `haswell` architecture released in 2013 and should be supported by any recent Intel or Amd cpu.
- `x86-64-v4`: Includes AVX512 support available on intel `skylake` server and `icelake`/`tigerlake`/`rocketlake` laptop and desktop processors.
Enable all features supported by the current CPU

These flags should be used in addition to the `simd` feature, since they will also affect the code generated by the simd library.
```ignore
RUSTFLAGS="-C target-cpu=native"
```

Enable all features supported by the current CPU, and enable full use of AVX512

```ignore
RUSTFLAGS="-C target-cpu=native -C target-feature=-prefer-256-bit"
```

Enable all features supported by CPUs more recent than haswell (2013)

```ignore
RUSTFLAGS="-C target-cpu=haswell"
```

For a full list of features and target CPUs use

```shell
$ rustc --print target-cpus
$ rustc --print target-features
```
6 changes: 3 additions & 3 deletions arrow/src/compute/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,16 @@ We use the term "kernel" to refer to particular general operation that contains

Types of functions

* Scalar functions: elementwise functions that perform scalar operations in a
- Scalar functions: elementwise functions that perform scalar operations in a
vectorized manner. These functions are generally valid for SQL-like
context. These are called "scalar" in that the functions executed consider
each value in an array independently, and the output array or arrays have the
same length as the input arrays. The result for each array cell is generally
independent of its position in the array.
* Vector functions, which produce a result whose output is generally dependent
- Vector functions, which produce a result whose output is generally dependent
on the entire contents of the input arrays. These functions **are generally
not valid** for SQL-like processing because the output size may be different
than the input size, and the result may change based on the order of the
values in the array. This includes things like array subselection, sorting,
hashing, and more.
* Scalar aggregate functions of which can be used in a SQL-like context
- Scalar aggregate functions of which can be used in a SQL-like context
37 changes: 2 additions & 35 deletions arrow/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,41 +18,8 @@
//! A complete, safe, native Rust implementation of [Apache Arrow](https://arrow.apache.org), a cross-language
//! development platform for in-memory data.
//!
//! # Performance Tips
//!
//! Arrow aims to be as fast as possible out of the box, whilst not compromising on safety. However,
//! it relies heavily on LLVM auto-vectorisation to achieve this. Unfortunately the LLVM defaults,
//! particularly for x86_64, favour portability over performance, and LLVM will consequently avoid
//! using more recent instructions that would result in errors on older CPUs.
//!
//! To address this it is recommended that you specify the override the LLVM defaults either
//! by setting the `RUSTFLAGS` environment variable, or by setting `rustflags` in your
//! [Cargo configuration](https://doc.rust-lang.org/cargo/reference/config.html)
//!
//! Enable all features supported by the current CPU
//!
//! ```ignore
//! RUSTFLAGS="-C target-cpu=native"
//! ```
//!
//! Enable all features supported by the current CPU, and enable full use of AVX512
//!
//! ```ignore
//! RUSTFLAGS="-C target-cpu=native -C target-feature=-prefer-256-bit"
//! ```
//!
//! Enable all features supported by CPUs more recent than haswell (2013)
//!
//! ```ignore
//! RUSTFLAGS="-C target-cpu=haswell"
//! ```
//!
//! For a full list of features and target CPUs use
//!
//! ```ignore
//! $ rustc --print target-cpus
//! $ rustc --print target-features
//! ```
//! Please see the [arrow crates.io](https://crates.io/crates/arrow)
//! page for feature flags and tips to improve performance.
//!
//! # Columnar Format
//!
Expand Down
27 changes: 24 additions & 3 deletions parquet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,38 @@

# Apache Parquet Official Native Rust Implementation

[![Crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet)
[![crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet)
[![docs.rs](https://img.shields.io/docsrs/parquet.svg)](https://docs.rs/parquet/latest/parquet/)

This crate contains the official Native Rust implementation of [Apache Parquet](https://parquet.apache.org/), which is part of the [Apache Arrow](https://arrow.apache.org/) project.

See [crate documentation](https://docs.rs/parquet/latest/parquet/) for examples and the full API.

## Rust Version Compatbility
## Rust Version Compatibility

This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler.

## Features
## Versioning / Releases

The arrow crate follows the [SemVer standard](https://doc.rust-lang.org/cargo/reference/semver.html) defined by Cargo and works well within the Rust crate ecosystem.

However, for historical reasons, this crate uses versions with major numbers greater than `0.x` (e.g. `19.0.0`), unlike many other crates in the Rust ecosystem which spend extended time releasing versions `0.x` to signal planned ongoing API changes. Minor arrow releases contain only compatible changes, while major releases may contain breaking API changes.

## Feature Flags

The `parquet` crate provides the following features which may be enabled in your `Cargo.toml`:

- `arrow` (default) - support for reading / writing [`arrow`](https://crates.io/crates/arrow) arrays to / from parquet
- `async` - support `async` APIs for reading parquet
- `json` - support for reading / writing `json` data to / from parquet
- `brotli` (default) - support for parquet using `brotli` compression
- `flate2` (default) - support for parquet using `gzip` compression
- `lz4` (default) - support for parquet using `lz4` compression
- `zstd` (default) - support for parquet using `zstd` compression
- `cli` - parquet [CLI tools](https://github.com/apache/arrow-rs/tree/master/parquet/src/bin)
- `experimental` - Experimental APIs which may change, even between minor releases

## Parquet Feature Status

- [x] All encodings supported
- [x] All compression codecs supported
Expand Down
3 changes: 3 additions & 0 deletions parquet/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@
//! [Apache Parquet](https://parquet.apache.org/), part of
//! the [Apache Arrow](https://arrow.apache.org/) project.
//!
//! Please see the [parquet crates.io](https://crates.io/crates/parquet)
//! page for feature flags and tips to improve performance.
//!
//! # Getting Started
//! Start with some examples:
//!
Expand Down

0 comments on commit 8e30d06

Please sign in to comment.