Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to merge using multiple columns in JOIN condition #5838

Merged
merged 27 commits into from Sep 28, 2021
Merged

Add ability to merge using multiple columns in JOIN condition #5838

merged 27 commits into from Sep 28, 2021

Conversation

asmirnov82
Copy link
Contributor

Fixes #5657

PR contains 3 main changes (these changes were done independently, step by step):

  1. Avoid code duplication in Merge implementation (commit - dc8daf0). To achive that I created new static function, that merges 2 dataframes (retained and supplementary). All merge implementations (Left. right, inner and outer) uses this function to perform correct merge.

  2. Change templated (generic) DataFrame Merge (...) function to general (not templated) DataFrame Merge(...)
    Commit a4735ce. To achive the goal I added method to find element entrance of one column into another inside the column (column is already templated). Base column implementation provides 2 methods: one is public not templated and abstract (used like an interface).

public abstract Dictionary<long, ICollection<long>> GetGroupedOccurrences(DataFrameColumn other, out HashSet<long> otherColumnNullIndices);

Second provides real implementation and is templated:

protected Dictionary<long, ICollection<long>> GetGroupedOccurrences<TKey>(DataFrameColumn other, out HashSet<long> otherColumnNullIndices)

Each column implements abstract method by calling protected method providing correct Type as a generic

This step was neccessary to allow group by multiple columns, that may have different data types

  1. Actual implementation of merging by multiple columns (all other commits), by finding occurrences on each column from the JOIN list and finding intersection of these occurrences

Alexey Smirnov added 25 commits April 27, 2021 01:12
Merge latest changes from dotnet/machinelearning
Back merge from base repository
Backmerge latest changes from dotnet/machinelearning
…://github.com/asmirnov82/machinelearning into feature/5657_dataframe_merge_multiple_columns

# Conflicts:
#	src/Microsoft.Data.Analysis/DataFrame.cs
#	src/Microsoft.Data.Analysis/GroupBy.cs
#	test/Microsoft.Data.Analysis.Tests/DataFrameGroupByTests.cs
…ltiple_columns

Feature/5657 dataframe merge multiple columns
@codecov
Copy link

codecov bot commented Jun 8, 2021

Codecov Report

Merging #5838 (2f22cd7) into main (3bf8cba) will increase coverage by 0.07%.
The diff coverage is 93.29%.

❗ Current head 2f22cd7 differs from pull request most recent head 3396554. Consider uploading reports for the commit 3396554 to get more accurate results

@@            Coverage Diff             @@
##             main    #5838      +/-   ##
==========================================
+ Coverage   68.19%   68.26%   +0.07%     
==========================================
  Files        1142     1142              
  Lines      242367   242534     +167     
  Branches    25355    25378      +23     
==========================================
+ Hits       165282   165574     +292     
+ Misses      70406    70279     -127     
- Partials     6679     6681       +2     
Flag Coverage Δ
Debug 68.26% <93.29%> (+0.07%) ⬆️
production 62.94% <84.78%> (+0.06%) ⬆️
test 88.61% <98.63%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rosoft.Data.Analysis/ArrowStringDataFrameColumn.cs 63.54% <0.00%> (-0.47%) ⬇️
src/Microsoft.Data.Analysis/Strings.Designer.cs 43.26% <0.00%> (-0.95%) ⬇️
...osoft.Data.Analysis.Tests/DataFrameGroupByTests.cs 100.00% <ø> (ø)
src/Microsoft.Data.Analysis/DataFrameColumn.cs 63.01% <86.66%> (+1.41%) ⬆️
src/Microsoft.Data.Analysis/DataFrame.Join.cs 94.50% <88.28%> (-0.23%) ⬇️
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs 99.21% <98.63%> (-0.22%) ⬇️
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs 79.15% <100.00%> (+0.11%) ⬆️
...c/Microsoft.Data.Analysis/StringDataFrameColumn.cs 70.88% <100.00%> (+0.22%) ⬆️
....ML.AutoML/PipelineSuggesters/PipelineSuggester.cs 78.74% <0.00%> (-6.30%) ⬇️
src/Microsoft.ML.AutoML/Sweepers/Parameters.cs 84.74% <0.00%> (-0.85%) ⬇️
... and 10 more

{
foreach (long nullIndex in smallerDataFrameColumnNullIndices)
var newValue = kvp.Value.Where(i => occurrences[kvp.Key].Contains(i)).ToArray();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q for next round of review: Is LINQ necessary here?

Copy link
Contributor

@pgovind pgovind Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asmirnov82 : I think you can get rid of the ToArray here right? newValue will then be an IEnumerable<long> and you can still call newValue.Any() in the next line.

Edit: Never mind, I see that shrinkedOccurences needs a value of ICollection

Copy link
Contributor

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. They are mostly for myself right now. Please don't answer them yet. I will do another round of reviewing these changes tomorrow.

Overall this PR is looking quite good. I like the changes, great job! Apologies for the late review here. I was super busy with .NET 6 stuff that I didn't have time to focus on DataFrame for 2 months. I'm relatively free all of September though, so I can review changes as they come in for DataFrame.

Copy link
Contributor

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done with the 2nd review. LGTM. Thanks for the work here @asmirnov82. This is a great addition to DataFrame!

@pgovind
Copy link
Contributor

pgovind commented Sep 17, 2021

I looked at the test failures and it looks unrelated. Do you mind rebasing on the latest main branch? That might fix the CI failures and let us merge this PR

Backmerge latest changes from original repo
@asmirnov82
Copy link
Contributor Author

I looked at the test failures and it looks unrelated. Do you mind rebasing on the latest main branch? That might fix the CI failures and let us merge this PR

I merged latest into my repo, this didn't help. Several tests failed, due to infrastructure issue, for example one with unability to load nuget package due to connection timeout (".dotnet\sdk\6.0.100-preview.3.21202.5\NuGet.targets(131,5): error : (NETCORE_ENGINEERING_TELEMETRY=Restore) Failed to download package 'runtime.native.System.4.3.0' from 'https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/45bacae2-5efb-47c8-91e5-8ec20c22b4f8/nuget/v3/flat2/runtime.native.system/4.3.0/runtime.native.system.4.3.0.nupkg'.
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (c78vsblobprodcus322.vsblob.vsassets.io:443)
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)

@pgovind
Copy link
Contributor

pgovind commented Sep 22, 2021

@michaelgsharp : I'm seeing this error in CI:

Starting test: Microsoft.ML.Tests.ImageTests.TestGreyscaleTransformImages
    Microsoft.ML.Tests.ImageTests.TestBackAndForthConversionWithoutAlphaNoInterleave [FAIL]
�[31;1m�[m�[37m      System.ArgumentException : Parameter is not valid.
�[m�[30;1m      Stack Trace:
�[m�[37m           at System.Drawing.Bitmap.UnlockBits(BitmapData bitmapdata)
�[m�[37m        /__w/1/s/src/Microsoft.ML.ImageAnalytics/VectorToImageTransform.cs(421,0): at Microsoft.ML.Transforms.Image.VectorToImageConvertingTransformer.Mapper.<>c__DisplayClass5_0`1.<GetterFromType>b__0(Bitmap& dst)
�[m�[37m        /__w/1/s/test/Microsoft.ML.Tests/ImagesTests.cs(573,0): at Microsoft.ML.Tests.ImageTests.TestBackAndForthConversionWithoutAlphaNoInterleave()
�[m�[30;1m      Output:
�[m�[37m        Test TestBackAndForthConversionWithoutAlphaNoInterleave: aborted: passed
�[mFinished test: Microsoft.ML.Tests.ImageTests.TestGreyscaleTransformImages with memory usage 172,474,368.00
Starting test: Microsoft.ML.Tests.ImageTests.TestBackAndForthConversionWithAlphaInterleave
Finished test: Microsoft.ML.Tests.ImageTests.TestBackAndForthConversionWithAlphaInterleave with memory usage 176,336,896.00
Starting test: Microsoft.ML.Tests.ImageTests.TestSaveImages
Finished test: Microsoft.ML.Tests.ImageTests.TestSaveImages with memory usage 177,635,328.00
�[37m  Finished:    Microsoft.ML.Tests

Any idea if this is being tracked somewhere?

@michaelgsharp
Copy link
Member

@pgovind its not yet, but I did bring this up yesterday. There are some tests that are flaky, and this is one of them. It doesn't happen always, and I believe its only with arm stuff right? It would be good to make an issue about flaky tests

@pgovind
Copy link
Contributor

pgovind commented Sep 28, 2021

Unrelated CI failure. Merging

@pgovind pgovind merged commit d13045f into dotnet:main Sep 28, 2021
@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame Merge Multiple Columns
3 participants