Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Apache.Arrow.Types.Decimal128Type #7094

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

piyushdubey
Copy link

Fixes #7082

@piyushdubey
Copy link
Author

piyushdubey commented Mar 21, 2024

@asmirnov82 Can you please help review this?
Additionally, I'd love your comments on what the difference between Decimal128 and Decimal256 Arrow type handling would be in a DataFrame?

Copy link

codecov bot commented Mar 21, 2024

Codecov Report

Attention: Patch coverage is 0% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 68.47%. Comparing base (8b483f4) to head (b00af0e).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7094      +/-   ##
==========================================
- Coverage   68.82%   68.47%   -0.35%     
==========================================
  Files        1255     1262       +7     
  Lines      250358   254272    +3914     
  Branches    25550    26237     +687     
==========================================
+ Hits       172310   174117    +1807     
- Misses      71438    73473    +2035     
- Partials     6610     6682      +72     
Flag Coverage Δ
Debug 68.47% <0.00%> (-0.35%) ⬇️
production 62.87% <0.00%> (-0.39%) ⬇️
test 88.56% <ø> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.Arrow.cs 86.18% <0.00%> (-5.43%) ⬇️

... and 64 files with indirect coverage changes

@@ -124,6 +124,15 @@ private static void AppendDataFrameColumnFromArrowArray(Field field, IArrowArray
}
break;
case ArrowTypeId.Decimal128:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to include a test for this?

@asmirnov82
Copy link
Contributor

@asmirnov82 Can you please help review this? Additionally, I'd love your comments on what the difference between Decimal128 and Decimal256 Arrow type handling would be in a DataFrame?

As far as I understand initialy Data.Analysis.DataFrame was designed to support Apache.Arrow format and allow constracting columns from Arrow array without memory copying (and least for reading, on any attempt to edit data – Arrow buffers are copied anyway).

But at some point new columns were added to the DataFrame that didn’t follow this rule. For example StringDataFrameColumn, DateTimeDataFrameColumn and DecimalDataFrameColumn. These columns have different from Arrow format of underlying data.

For Arrow strings DataFrame provides alternative implementation - ArrowStringDataFrameColumn (but it’s immutable).
DateTimeDataFrameColumn converts Arrow data to .Net DateTime struct (the same approch that you use in this PR).
As on any attempt to edit data coping is happen anyway I don't see any drawback in this.

One possible issue that we have to pay attention is that Arrow Decimal128 uses 4 bytes for store data and user is able to configure precision and scale. As far as I know .Net Decimal has no concept of precision. It uses 3 DWORDs (12 bytes) to store the actual data, and therefore has a maximum precision of 28 (not 34). So converting Arrow Decimal128 to .Net decimal may happen with information loss

As according to Decimal256 implementation – I see 2 possibilities: using SqlDecimal from System.Data.SqlTypes namespace or waiting support of Decimal256 by .Net framework. There is a proposal for supporting new IEEE compliant decimal types - dotnet/runtime#81376

@luisquintanilla @JakeRadMSFT @ericstj could you please provide your thoughts?

decimal128DataFrameColumn[i] = arrowDecimal128Array.GetValue(i);
}
}
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code allows to convert from Arrow to DecimalDataFrameColumn. Could you please also add backward conversion from DecimalDataFrameColumn to Arrow? It's done in PrimitiveDataFrameColumn.ToArrowArray(long startIndex, int numberOfRows) method

@luisquintanilla
Copy link
Contributor

As according to Decimal256 implementation – I see 2 possibilities: using SqlDecimal from System.Data.SqlTypes namespace or waiting support of Decimal256 by .Net framework. There is a proposal for supporting new IEEE compliant decimal types - dotnet/runtime#81376

I don't think that proposal is something that's being actively worked on at this time, so probably not something that would happen in the .NET 9 timeframe.

@ericstj
Copy link
Member

ericstj commented Mar 22, 2024

But at some point new columns were added to the DataFrame that didn’t follow this rule. For example StringDataFrameColumn, DateTimeDataFrameColumn and DecimalDataFrameColumn. These columns have different from Arrow format of underlying data.
So converting Arrow Decimal128 to .Net decimal may happen with information loss

That's worthwhile to call out in the docs. We should update the API docs here:

/// Wraps a <see cref="DataFrame"/> around an Arrow <see cref="RecordBatch"/> without copying data

/// Returns an <see cref="IEnumerable{RecordBatch}"/> mostly without copying data

FWIW it does look like Arrow has precedent for this conversion, though also offers SqlDecimal. Neither can be done with zero-copy, but the latter wouldn't lose precision (I think): https://github.com/apache/arrow/blob/2babda0ba22740c092166b5c5d5d7aa9b4797953/csharp/src/Apache.Arrow/Arrays/Decimal128Array.cs#L42-L63

Another option would be to expose some types unique to this assembly that's meant to represent this data without copying. Not saying we should - but that's a possibility. Probably those types would be better to live in the Arrow project if at all.

@tannergooding and @eerhardt might have more thoughts on this.

@tannergooding
Copy link
Member

If there's need for the IEEE 754 Decimal128 type purely as an interchange type, that is itself feasible to add in .NET 9

It just isn't in scope to add the entirety of the approved API surface area, including operators.

@CurtHagenlocher
Copy link

FWIW it does look like Arrow has precedent for this conversion, though also offers SqlDecimal. Neither can be done with zero-copy, but the latter wouldn't lose precision (I think): https://github.com/apache/arrow/blob/2babda0ba22740c092166b5c5d5d7aa9b4797953/csharp/src/Apache.Arrow/Arrays/Decimal128Array.cs#L42-L63

Yes, SqlDecimal wouldn't lose precision for Arrow's decimal128, but that's at the cost of occupying 160 bits instead of 128 . The IEEE Decimal128 isn't wide enough to hold the 38 digits of decimal precision supported by Arrow, though it is (as noted) better than the CLR decimal. (I think there were some backwards-compatibility considerations in CLR decimal for OLEDB.)

There's an interesting tension between traditional database representations of decimal values and the requirements of general-purpose programming languages. In a database, you're always operating in the context of a particular column so you can factor out properties like precision and scale to optimize storage. In a programming language, individual values need to carry that context with them because most type systems aren't designed to support that kind of parameterization. So you either end up needing extra bytes (like SqlDecimal) or potentially losing precision (like IEEE Decimal128) when working with database decimals. And because most database decimal values don't actually need 38 digits of precision, the latter often ends up looking more attractive.

I like to think that if you squint a little, you can almost see the historical split between FORTRAN and COBOL (and their associated use cases) in this topic.

One nice thing about incorporating precision and scale into the type is that the values are automatically normalized. This is super important in a database representation because comparisons and simple computation (like sums) are going to be the majority of operations performed on these values.

Anyway, sorry for the digression. Representation of numbers in a computer is a fun topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Apache.Arrow.Types.Decimal128Type
6 participants