Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Investigate how to reduce size of DataType or arrays without making the API too awkward #439

Open
jorgecarleitao opened this issue Sep 22, 2021 · 8 comments · May be fixed by #1469
Open
Labels
help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged.

Comments

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Sep 22, 2021

use arrow2::array::*;
use arrow2::datatypes::DataType;

fn main() {
    println!("{}", std::mem::size_of::<DataType>());
    println!("{}", std::mem::size_of::<Int32Array>());
}

yields

64
128

My initial attempt for DataType that boxes all String and Vec below (50% reduction to 32 bytes).

It makes it less friendly to use, but maybe the solution is to offer pub fn DataType::timestamp(TimeUnit, Option<String>) -> DataType that boxes the timezone, to make it easier to use (and equivalent to the other types)?

Another thing to consider is that cloning a DataType is currently expensive due to the String, Vec and Box<Field>. An alternative is to Arc them, like we do for arrays.

pub enum IntervalUnit {
    YearMonth,
    DayTime,
    MonthDayNano,
}

pub enum TimeUnit {
    Second,
    Millisecond,
    Microsecond,
    Nanosecond,
}

pub enum DataType {
    Null,
    Boolean,
    Int8,
    Int16,
    Int32,
    Int64,
    UInt8,
    UInt16,
    UInt32,
    UInt64,
    Float16,
    Float32,
    Float64,
    Timestamp(TimeUnit, Option<Box<String>>),
    Date32,
    Date64,
    Time32(TimeUnit),
    Time64(TimeUnit),
    Duration(TimeUnit),
    Interval(IntervalUnit),
    Binary,
    FixedSizeBinary(i32),
    LargeBinary,
    Utf8,
    LargeUtf8,
    List(Box<i32>),
    FixedSizeList(Box<i32>, i32),
    LargeList(Box<i32>),
    Struct(Vec<i32>),
    Union(Box<Vec<i32>>, Option<Box<Vec<i32>>>, bool),
    Dictionary(Box<DataType>, Box<DataType>),
    Decimal(usize, usize),
    Extension(Box<String>, Box<DataType>, Option<Box<String>>),
}


fn main() {
    println!("{}", std::mem::size_of::<DataType>());
}
@jorgecarleitao jorgecarleitao added help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged. labels Sep 22, 2021
@jorgecarleitao jorgecarleitao changed the title Investigate how to reduce size of DataType or arrays. Investigate how to reduce size of DataType or arrays without making the API too poor Sep 22, 2021
@jorgecarleitao jorgecarleitao changed the title Investigate how to reduce size of DataType or arrays without making the API too poor Investigate how to reduce size of DataType or arrays without making the API too awkward Sep 22, 2021
@houqp
Copy link
Collaborator

houqp commented Sep 23, 2021

Do we need to consider or measure the overhead for accessing stack v.s. heap data?

As for expensive clone, I wonder if it would work better if we let users put the reference counted wrapper over the whole DataType enum variable wherever they need to avoid the full clone.

@jorgecarleitao
Copy link
Owner Author

That is a good point. We could make our arrays have an Arc<DataType> instead of a DataType and most of the problems would go away.

@Dandandan
Copy link
Collaborator

One small other optimization is s Box<[T]> for Vec<T> and Box<str> instead of String.

Arc should be fine too. Idealy they would only be there for the expensive to clone arms?

@ritchie46
Copy link
Collaborator

Arc should be fine too. Idealy they would only be there for the expensive to clone arms?

I like that idea better. Then the rest of the arms are still copy.

@houqp
Copy link
Collaborator

houqp commented Sep 23, 2021

Good call, I agree adding Arc on individual arm would be more efficient than Arc<DataType>.

@jorgecarleitao
Copy link
Owner Author

what do you both mean with "individual arm"?

@ritchie46
Copy link
Collaborator

what do you both mean with "individual arm"?

Arc-ing only the expensive non copy types:

enum DataType {
    Int8,
    Int16
    ...
    Timestamp(Arc<(TimeUnit, Option<Box<String>>)>),
    Extension(Arc<(Box<String>, Box<DataType>, Option<Box<String>>)>),

}

@teh-cmc
Copy link

teh-cmc commented Apr 13, 2023

FYI I've started digging into this.

I'm still not entirely sure which route I'm going to take (wrapping the entire type, or individual arms, or defining a new wrapper type, or something else entirely...).
DataType is being matched on and destructured in a lot of places so I'll try and find the path of least resistance, pretty much.

Anyways, just a heads up just in case 🙃

@teh-cmc teh-cmc linked a pull request Apr 17, 2023 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants