Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrameColumn Apply and DropNulls methods #7123

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

asmirnov82
Copy link
Contributor

@asmirnov82 asmirnov82 commented Apr 9, 2024

Fixes #7107 as was asked in #6144 (comment)

Additionaly:

  1. Fix incorrect IsNumeric method
  2. Fix error FillNulls crashes with NotImplemented exception on DataFrame with VBufferDataFrameColumn
  3. Improve speed and redesign API (legacy API marked as obsolete), internal methods are rewritten to use new API
  4. Add extra unit tests

Reasons for refactoring:

  1. Legacy implementation of ApplyElementwise is written not only to apply function on each element in the column, but also to be used for fast columns iteration, that's why it has rowIndex as one of it's arguments. This duplication of responsibilities is very confusing and absolutely not straightforward. More over, it is slow, as instead of only reading values, ApplyElementwise writes function results into into memory, also it converts read only columns to editable (and that again involves memory copy). So in some circumstances memory can be excessively copied even twice.
  2. Legacy method requires to provide function, that takes into account null values - this is not userfriendly. For working with Nulls there are already FillNulls and DropNulls methods. New method applies user function only to valuable values, this also leads to better performance

@asmirnov82
Copy link
Contributor Author

cc @JakeRadMSFT

Copy link

codecov bot commented Apr 9, 2024

Codecov Report

Attention: Patch coverage is 74.92355% with 82 lines in your changes are missing coverage. Please review.

Project coverage is 68.50%. Comparing base (07eb681) to head (e2f4d4e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7123      +/-   ##
==========================================
+ Coverage   68.48%   68.50%   +0.01%     
==========================================
  Files        1262     1263       +1     
  Lines      255113   255262     +149     
  Branches    26364    26398      +34     
==========================================
+ Hits       174722   174862     +140     
- Misses      73682    73696      +14     
+ Partials     6709     6704       -5     
Flag Coverage Δ
Debug 68.50% <74.92%> (+0.01%) ⬆️
production 62.88% <61.86%> (+<0.01%) ⬆️
test 88.62% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.cs 85.81% <100.00%> (+0.39%) ⬆️
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs 99.89% <100.00%> (+<0.01%) ⬆️
...ta.Analysis.Tests/PrimitiveDataFrameColumnTests.cs 99.74% <100.00%> (+0.03%) ⬆️
....Data.Analysis.Tests/StringDataFrameColumnTests.cs 100.00% <100.00%> (ø)
...icrosoft.Data.Analysis/PrimitiveColumnContainer.cs 86.52% <96.96%> (+0.43%) ⬆️
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs 73.05% <97.29%> (+0.61%) ⬆️
src/Microsoft.Data.Analysis/DataFrameColumn.cs 62.96% <0.00%> (-0.22%) ⬇️
...nalysis/DataFrameColumns/VBufferDataFrameColumn.cs 44.74% <30.30%> (+0.14%) ⬆️
...sis/DataFrameColumns/ArrowStringDataFrameColumn.cs 59.16% <18.75%> (-4.01%) ⬇️
...Analysis/DataFrameColumns/StringDataFrameColumn.cs 69.23% <45.28%> (-1.95%) ⬇️

... and 4 files with indirect coverage changes

@asmirnov82 asmirnov82 changed the title Implement DataFrameColumn Apply method and DropNulls methods Implement DataFrameColumn Apply and DropNulls methods Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the Apply method available to StringDataFrame column
1 participant