Don't scan first column on empty projection #3214

Dandandan · 2022-08-21T09:35:19Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Depends on: #2603

When we perform without needing the like SELECT COUNT(1) FROM table, the plan always reads the first column (whatever this is). This is inefficient: in case of formats like Parquet we can avoid scanning / reading the column and just produce the row counts. For non-columnar formats it can avoid unnecessary parsing (or implementing a fast path, i.e. only counting lines).

Projection: Count(1)
  TableScan: test projection=[a]

Should become:

Projection: Count(1)
  TableScan: test projection=[]

Describe the solution you'd like
We can push the responsibility of dealing with producing an array with a certain number of rows into the individual readers / other parts of the plans. They should produce RecordBatches with the number of rows.
We should remove the line projection.insert(0); from projection push down.

Describe alternatives you've considered

Additional context
Some queries in the ClickBench benchmark show this performance issue (https://benchmark.clickhouse.com/ ):

| logical_plan  | Projection: #COUNT(UInt8(1))                                                                                                       |
|               |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]                                                                                |
|               |     TableScan: hits projection=[WatchID]                                                                                           |

The text was updated successfully, but these errors were encountered:

alamb · 2022-08-21T10:55:48Z

👍 this is an important optimization as select count(*) type queries are so common

HaoYang670 · 2022-08-29T06:20:31Z

I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100

It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?

Dandandan · 2022-08-29T07:12:07Z

I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100

It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?

The reason is that several Arrow readers don´t support empty projections. I added a PR for csv / json upstream apache/arrow-rs#2604

HaoYang670 · 2022-08-29T07:25:55Z

The reason is that several Arrow readers don´t support empty projections.

Thank you, @Dandandan. I could reproduce the error when reading csv with empty projection

Arrow error: Invalid argument error: must either specify a row count or at least one column

If this depends on the support of arrow-rs, should we add a new label such as arrow-dependency for this issue?

avantgardnerio · 2022-08-29T15:40:09Z

Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.

Dandandan · 2022-08-29T17:00:43Z

Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.

You're right, for a schema provider that has statistics available, we can skip scanning.
AFAIK DataFusion already has support for using the statistics-provided count/min/max from the provider (e.g. delta lake).

You're right that we could also use the parquet statistics for files instead of skipping reading the columns. I think we don't support this yet. At least for min/max statisticd his avoids having to scan the entire column and compute the min/max.

alamb · 2022-08-31T13:39:59Z

I think @tustvold has been thinking of this in the context of the various parquet reader improvements

tustvold · 2022-08-31T14:44:50Z

I think there are two different optimisations being discussed here:

Skip interacting with the file based on catalog statistics if available
Remove projection "hack" and delegate to file readers

Parquet has supported the latter since apache/arrow-rs#1560, and CSV/JSON will support it once apache/arrow-rs#2604 is released. I think it should be then be possible to remove the workaround, as it will be no longer necessary.

As to the former, I think it should be fairly straightforward to implement a physical optimiser pass that uses statistics to simplify counts into projections based on statistics if available. I had thought we had already implemented this tbh... 🤔 Edit: Yup AggregateStatistics

alamb · 2022-08-31T15:10:23Z

Remove projection "hack" and delegate to file readers

Yes, this is what I was talking about. https://docs.rs/datafusion/latest/datafusion/physical_optimizer/aggregate_statistics/struct.AggregateStatistics.html is very cool 👍 (thanks @rdettai !)

Dandandan · 2022-09-07T06:12:39Z

Draft PR here:
#3382
It turns out it is a bit more complex than removing a line, as every Exec node should support producing records without columns/empty schema. I think the only thing we can do is hunting every RecordBatch::try_new and adapting it for projections without columns 🤔

alamb · 2022-09-08T20:54:18Z

Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?

Dandandan · 2022-09-09T05:18:11Z

Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?

Thanks, I did just that yesterday, for RecordBach::project: apache/arrow-rs#2691. Schema::project already seems to handle empty projections just fine 🎉

Dandandan · 2023-10-25T07:57:14Z

Closed by #7920

Dandandan added enhancement New feature or request performance labels Aug 21, 2022

Dandandan mentioned this issue Aug 29, 2022

Support empty projection in CSV, JSON readers apache/arrow-rs#2603

Closed

Dandandan added the waiting-on-upstream PR is waiting on an upstream dependency to be updated label Aug 29, 2022

Dandandan mentioned this issue Sep 6, 2022

Don't scan first column on empty projection #3382

Closed

Dandandan removed the waiting-on-upstream PR is waiting on an upstream dependency to be updated label Sep 7, 2022

Dandandan mentioned this issue Jul 27, 2023

Change empty projection to not add an extra column #7114

Closed

This was referenced Oct 11, 2023

Compare DataType based on memory size apache/arrow-rs#4919

Closed

Add small column on empty projection #7833

Merged

haohuaijin mentioned this issue Oct 24, 2023

support scan empty projection #7920

Merged

Dandandan closed this as completed Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't scan first column on empty projection #3214

Don't scan first column on empty projection #3214

Dandandan commented Aug 21, 2022 •

edited

alamb commented Aug 21, 2022

HaoYang670 commented Aug 29, 2022

Dandandan commented Aug 29, 2022

HaoYang670 commented Aug 29, 2022 •

edited

avantgardnerio commented Aug 29, 2022

Dandandan commented Aug 29, 2022

alamb commented Aug 31, 2022

tustvold commented Aug 31, 2022 •

edited

alamb commented Aug 31, 2022

Dandandan commented Sep 7, 2022

alamb commented Sep 8, 2022

Dandandan commented Sep 9, 2022

Dandandan commented Oct 25, 2023

Don't scan first column on empty projection #3214

Don't scan first column on empty projection #3214

Comments

Dandandan commented Aug 21, 2022 • edited

alamb commented Aug 21, 2022

HaoYang670 commented Aug 29, 2022

Dandandan commented Aug 29, 2022

HaoYang670 commented Aug 29, 2022 • edited

avantgardnerio commented Aug 29, 2022

Dandandan commented Aug 29, 2022

alamb commented Aug 31, 2022

tustvold commented Aug 31, 2022 • edited

alamb commented Aug 31, 2022

Dandandan commented Sep 7, 2022

alamb commented Sep 8, 2022

Dandandan commented Sep 9, 2022

Dandandan commented Oct 25, 2023

Dandandan commented Aug 21, 2022 •

edited

HaoYang670 commented Aug 29, 2022 •

edited

tustvold commented Aug 31, 2022 •

edited