Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow developers to supply their own function to infer column data types from data while loading CSVs #7142

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

sevenzees
Copy link

@sevenzees sevenzees commented Apr 26, 2024

Fixes #7141

Currently when you use LoadCsv or LoadCsvFromString without supplying data types for each column, the code will try to guess the data types based on the data in the CSV file. This is good, but the problem is that the default type inference code only considers bool, float, DateTime, and string for column types. Sometimes the user may need another data type, such as int, long, or double (see issue 6347 for an example where someone had a problem with the float data type that was chosen by default) but not know the structure of the data ahead of time.

I would like to be able to pass in my own custom type inference logic to override the default GuessKind implementation that is given in the library right now. If no custom guess type function is provided to the LoadCsv or LoadCsvFromString methods, then the code should work the same as it does today.

@sevenzees
Copy link
Author

@dotnet-policy-service agree

Copy link

codecov bot commented Apr 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.57%. Comparing base (72cfdf6) to head (b6cd225).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7142      +/-   ##
==========================================
+ Coverage   68.55%   68.57%   +0.01%     
==========================================
  Files        1259     1259              
  Lines      255844   255969     +125     
  Branches    26434    26452      +18     
==========================================
+ Hits       175392   175518     +126     
- Misses      73717    73718       +1     
+ Partials     6735     6733       -2     
Flag Coverage Δ
Debug 68.57% <100.00%> (+0.01%) ⬆️
production 62.90% <100.00%> (+<0.01%) ⬆️
test 88.72% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.IO.cs 83.50% <100.00%> (+0.38%) ⬆️
...Microsoft.Data.Analysis.Tests/DataFrame.IOTests.cs 99.13% <100.00%> (+0.09%) ⬆️

... and 4 files with indirect coverage changes

@sevenzees sevenzees marked this pull request as draft April 27, 2024 22:34
@sevenzees sevenzees marked this pull request as ready for review April 28, 2024 00:14
@michaelgsharp
Copy link
Member

@JakeRadMSFT @luisquintanilla can I get you 2 to take a look at this please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow developers to supply their own function to infer column data types from data while loading CSVs
2 participants