Allow developers to supply their own function to infer column data types from data while loading CSVs #7142

sevenzees · 2024-04-26T21:50:38Z

Currently when you use LoadCsv or LoadCsvFromString without supplying data types for each column, the code will try to guess the data types based on the data in the CSV file. This is good, but the problem is that the default type inference code only considers bool, float, DateTime, and string for column types. Sometimes the user may need another data type, such as int, long, or double (see issue 6347 for an example where someone had a problem with the float data type that was chosen by default) but not know the structure of the data ahead of time.

I would like to be able to pass in my own custom type inference logic to override the default GuessKind implementation that is given in the library right now. If no custom guess type function is provided to the LoadCsv or LoadCsvFromString methods, then the code should work the same as it does today.

sevenzees · 2024-04-26T21:51:24Z

@dotnet-policy-service agree

codecov · 2024-04-26T23:17:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.57%. Comparing base (72cfdf6) to head (b6cd225).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7142      +/-   ##
==========================================
+ Coverage   68.55%   68.57%   +0.01%     
==========================================
  Files        1259     1259              
  Lines      255844   255969     +125     
  Branches    26434    26452      +18     
==========================================
+ Hits       175392   175518     +126     
- Misses      73717    73718       +1     
+ Partials     6735     6733       -2

Flag	Coverage Δ
Debug	`68.57% <100.00%> (+0.01%)`	⬆️
production	`62.90% <100.00%> (+<0.01%)`	⬆️
test	`88.72% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.IO.cs	`83.50% <100.00%> (+0.38%)`	⬆️
...Microsoft.Data.Analysis.Tests/DataFrame.IOTests.cs	`99.13% <100.00%> (+0.09%)`	⬆️

... and 4 files with indirect coverage changes

…has less columns than the others

This reverts commit 5a2ee30. # Conflicts: # src/Microsoft.Data.Analysis/DataFrame.IO.cs

…coverage.

michaelgsharp · 2024-05-08T19:56:36Z

@JakeRadMSFT @luisquintanilla can I get you 2 to take a look at this please?

sevenzees added 2 commits April 26, 2024 14:26

Allow developers to supply their own GuessType function

5f40c68

Add a test for using a custom GuessType function.

7cb8452

dotnet-policy-service bot added the community-contribution label Apr 26, 2024

sevenzees added 4 commits April 27, 2024 15:04

Fix typo in string resource identifier

5a2ee30

Convert 0-based line number to 1-based line number in error message.

38545b7

Add test that FormatException is thrown when one row in a data frame …

f053926

…has less columns than the others

Revert "Fix typo in string resource identifier"

338a72f

This reverts commit 5a2ee30. # Conflicts: # src/Microsoft.Data.Analysis/DataFrame.IO.cs

sevenzees marked this pull request as draft April 27, 2024 22:34

sevenzees marked this pull request as ready for review April 28, 2024 00:14

Add a column filled with null to the test data frame to improve test …

b6cd225

…coverage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow developers to supply their own function to infer column data types from data while loading CSVs #7142

Allow developers to supply their own function to infer column data types from data while loading CSVs #7142

sevenzees commented Apr 26, 2024 •

edited

sevenzees commented Apr 26, 2024

codecov bot commented Apr 26, 2024 •

edited

michaelgsharp commented May 8, 2024

Allow developers to supply their own function to infer column data types from data while loading CSVs #7142

Are you sure you want to change the base?

Allow developers to supply their own function to infer column data types from data while loading CSVs #7142

Conversation

sevenzees commented Apr 26, 2024 • edited

sevenzees commented Apr 26, 2024

codecov bot commented Apr 26, 2024 • edited

Codecov Report

michaelgsharp commented May 8, 2024

sevenzees commented Apr 26, 2024 •

edited

codecov bot commented Apr 26, 2024 •

edited