Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame doesn't decode boolean arrays correctly from Arrow #7115

Open
johnnyg opened this issue Apr 3, 2024 · 6 comments
Open

DataFrame doesn't decode boolean arrays correctly from Arrow #7115

johnnyg opened this issue Apr 3, 2024 · 6 comments
Labels
untriaged New issue has not been triaged

Comments

@johnnyg
Copy link

johnnyg commented Apr 3, 2024

System Information (please complete the following information):

  • OS & Version: Windows 10
  • ML.NET Version: 0.21.1
  • .NET Version: .NET 5.0

Describe the bug
Creating a dataframe from an arrow record batch where a column is a boolean array produces incorrect results (and occasionally even throws exceptions).

To Reproduce
Run:

public void DataFrameDecodeBooleans()
{
    BooleanArray boolArray = new BooleanArray.Builder().AppendNull().Append(false).Append(true).Build();
    Field boolField = new Field("col", BooleanType.Default, true);
    Schema schema = new Schema(new[] { boolField }, null);
    RecordBatch batch = new RecordBatch(schema, new IArrowArray[] { boolArray }, boolArray.Length);
    DataFrame df = DataFrame.FromArrowRecordBatch(batch);
    DataFrameColumn boolCol = df["col"];
    Assert.Equal(boolArray.Length, boolCol.Length);
    for (int row = 0; row < boolArray.Length; row++)
    {
        Assert.Equal(boolArray.GetValue(row), boolCol[row]);
    }
}

Expected behavior
Above test passes

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label Apr 3, 2024
@asmirnov82
Copy link
Contributor

@ericstj @luisquintanilla @michaelgsharp I investigated the defect and it appeared that it is related to the same issue as was discussed in #7094.
Since version 1.0 Apache Arrow uses bitmap for storing boolean values (https://issues.apache.org/jira/browse/ARROW-8788). In the DataFrame BooleanDataFrameColumn still inherits from PrimitiveDataFrameColumn and uses 1 byte to store each boolean.
During conversion DataFrame just copies memory. By default Apache Arrow allocates memory by 64 byte blocks, so for RecordBatches with less than 64 rows there is enough memory in the internal buffer and only incorrect values are fetched during access to converted column. But for RecordBatches with more than 64 rows - ArgumentOutOfRangeException is thrown on attempt to convert internal memroy buffer to the ReadOnlySpan with higher length.

So I see 2 possible solutions:

  1. Introduce different implementation for BooleanDataFrameColumn, that uses 1 bit for each boolean (this will require to change API and make PrimitiveDataFrameColumn class abstract for preventing users to create instances of PrimitiveDataFrameColumn)
  2. Use extra memory allocation and values convertion while DataFrame <-> Apache Arrow interop operations

I may take this task into development. Please suggest what approach is preferred?

@ericstj
Copy link
Member

ericstj commented Apr 4, 2024

It sounds like option 1 is preferable. Is the only side-effect a breaking change that makes PrimitiveDataFrameColumn abstract? Do we know if anyone would actually be trying to create instances of it? cc @eerhardt

@eerhardt
Copy link
Member

eerhardt commented Apr 4, 2024

Do we know if anyone would actually be trying to create instances of it?

Looks like here is one.
https://github.com/microsoft/sql-server-language-extensions/blob/e51909ed3748e5bccae9dcee31f8e4b4a903c7ff/language-extensions/dotnet-core-CSharp/src/managed/CSharpInputDataSet.cs#L152

We haven't officially "1.0"d this library yet right? Since BooleanDataFrameColumn is basically completely broken in this mode, what about just changing BooleanDataFrameColumn to no longer inherit from PrimitiveDataFrameColumn<bool> and manage its own storage, similar to how strings are handled?

Or introducing a new PrimitiveDataFrameColumnBase<T> class that both BooleanDataFrameColumn and PrimitiveDataFrameColumn<T> inherit from? The base type would have the operations defined on it, so people could still use it in a generic fashion, but it wouldn't force the storage to be in terms of T.

We had to do something similar in Arrow where

  • BooleanArray : Array
  • PrimitiveArray<T> : Array
  • Int32Array : PrimitiveArray<int>

@asmirnov82
Copy link
Contributor

asmirnov82 commented Apr 5, 2024

Is the only side-effect a breaking change that makes PrimitiveDataFrameColumn abstract?

Another side effect is performance. New implementation working with individual bits will be slower, what I think is fine taking into account 8 times benefit in memory usage.

Except cases, when Boolean column is used in API, for example in filtering and cloning, like in DataFrame public DataFrame Filter(PrimitiveDataFrameColumn<bool> filter) and in DataFrameColumn public DataFrameColumn Clone(DataFrameColumn mapIndices = null, bool invertMapIndices = false, long numberOfNullsToAppend = 0)

I think the next step can be rethinking of this API as it's already quite slow (due to BooleanDataFrameColumn uses bit operations when accessing validitybitmap) (see #6164) and memory consuming. Also current implementation is not straightforward and incorrectly works with null values (#6820).

For example, new API can be implemented as extension over IDataView: public static IDataView Filter<TSource>(this IDataView view, string columnName, Func<TSource, bool> func)

So, instead of

var boolFilter = df["timestamp"].ElementwiseGreaterThanOrEqual(unixStartTime);
var hourlydata = df.Filter(boolFilter);
var boolFilter2 = hourlydata["timestamp"].ElementwiseLessThan(unixEndTime);
hourlydata = hourlydata.Filter(boolFilter2);

more simple code can be used:

var hourlydata = df.Filter("timestamp", x => x >= unixStartTime && x < unixEndTime).ToDataFrame();

@JakeRadMSFT
Copy link
Contributor

JakeRadMSFT commented Apr 5, 2024

Isn't the way we handle nulls basically doing the 1bit for each Boolean thing?

Would we be able to use the new BooleanDataFrameColumn to be our implementation for our nulls?

@asmirnov82
Copy link
Contributor

asmirnov82 commented Apr 5, 2024

Jake, you are right, currently we store null information in validity buffer using 1 bit per value - that is the reason why using BooleanDataFrameColumn takes more time, that just using the list of boolean values (as we also have to deal with this validity buffer) - and, to my mind, should be avoided in filtering API.
New BooleanDataFrame column will have 2 bitmaps - first for validity and second for actual values, similar way as Apache Arrow BooleanArray.

Agree, that extracting some code from PrimitiveColumnContainer<T> to works with NullBitMapBuffers into separate Bitmap class instead of using BitUtility directly may help to reuse some logic and avoid unnecessary code duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untriaged New issue has not been triaged
Projects
None yet
Development

No branches or pull requests

5 participants