Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user/custom metadata in ListObjectsAsync api's return #1043

Closed

Conversation

ebozduman
Copy link
Collaborator

This PR adds user metadata info into the list of object data in the return value of ListObjectsAsync sdk command.t
This change also required and touched different areas of the code, hence the changes included some refactoring of the old code.

@ebozduman ebozduman changed the title WIP: Code refactoring and adding user metadata info when listing objects WIP: Code refactoring and user metadata info when listing objects Apr 3, 2024
@ebozduman ebozduman changed the title WIP: Code refactoring and user metadata info when listing objects WIP: Add user/custom metadata in ListObjectsAsync api's return Apr 8, 2024
Copy link
Contributor

@ramondeklein ramondeklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't complete the full review, because there are some serious issues. There are a few key-points that need attention (IMHO):

  1. Use IObservable<Item> for the ListObjectsAsync is error-prone. In modern .NET there is a much easier way to deal with this and that is to use IAsyncEnumerable. The use of IObservable<T> interface is highly opiniated, but I'm not a big fan. Because most other API methods are already task-based, using IAsyncEnumerable would be more logical. Of course, this would be a breaking change.
  2. The "test" is actually a large Run function that does all kind of things, but it would be better to use something link NUnit or XUnit to run the tests (just like in Go).
  3. There are some serious bugs that need to be addressed (like listing objects).

This PR also contains a lot of other changes, so it's a pretty big PR. I would be in favor of splitting it into multiple PRs. I could work on adding the meta-data as a first step and then gradually make the other changes.

If you prefer to have an (online) meeting to discuss some items, then let me know.

Minio.Functional.Tests/Program.cs Outdated Show resolved Hide resolved
Minio.Examples/Program.cs Outdated Show resolved Hide resolved
Minio.Functional.Tests/FunctionalTest.cs Outdated Show resolved Hide resolved
Minio.Functional.Tests/Minio.Functional.Tests.csproj Outdated Show resolved Hide resolved
Minio/ApiEndpoints/BucketOperations.cs Outdated Show resolved Hide resolved
Minio/DataModel/Item.cs Outdated Show resolved Hide resolved
@@ -54,4 +48,6 @@ public string ETag
return dt;
}
}

[XmlElement] public List<MetadataItem> UserMetadata { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always add the UserMetadata XML tag and changes the scheme.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the goal. The scheme is supposed to have it there.
Maybe you meant "don't include it if empty" or maybe something else.
Please clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want explicitly annotate the property with an XML serialization attribute, but you don't do that for the other attributes. Either annotate all or none. What would happen if this attribute was removed?

Also it's a strange choice to use a List here, because metadata is key/value data, so a Dictionairy should be much more logical.

And if you want to annotate it, then breaking XML schemas may be an issue in an API. That's why I would ignore this item when the list/dictionary is empty. That wouldn't break existing code. But looking at this SDK, I don't think that this was ever an issue, so forget about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data structure List<MetaDataItem> is chosen this way, by the author of the original PR, I think that is because Server side sends the data the same way, and I did not change it.

The existing lint check and compiler enforces a collection definition in a class to be an interface. However, this causes exception when serializing for XML as interfaces are not serializable. So, having [XElement] was the only way I can have a serializable List.
I also addressed your question about if UserMetadata is empty. Now, if it is empty, it will be skipped and it is not going to be part of the XML.

I'd like to push this PR, and file any other requests/comments as a separate issue, if it is OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a dictionary would be more logical, because:

  1. You can use the index operator [] to fetch a specific key.
  2. No key can every be specified more than once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, and I agree with you. It would be make it, less complicated. However, all other MinIO SDKs (golang, python, java et.al.) support listing objects with user/custom metadata when desired. This feature is not available in AWS, but (I searched a little) there were requests from customers for this feature to be added to MinIO.
We can discuss this in the meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection to provide the metadata while listing. I have two issues with the implementation:

  • Userdata should be returned as a dictionary and not a list. Keys can only be used once, so a dictionary would make more sense.
  • The implementation should be done the "MinIO" way using the metadata=true query parameter and not using the StatObject request that needs to be invoked for every object that is returned and kills performance.

We can discuss both in a meeting, but I don't see why. It's obvious for me to make this change. But maybe I'm missing something...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the StatObject api and started using ListObjectsAsync api command's metadata=true query parameter.

Minio/DataModel/Response/GetObjectsListResponse.cs Outdated Show resolved Hide resolved
.WithRecursive(recursive);
.WithRecursive(recursive)
.WithVersions(versions)
.WithUserMetadata(includeUserMetadata);
var observable = minio.ListObjectsAsync(listArgs);
var subscription = observable.Subscribe(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method returns directly after the Subscribe, so it will run in parallel with other tests and that will make it prone to errors. Better use something like this

            var tcs = new TaskCompletionSource();
            using var subscription = observable.Subscribe(
                item => Console.WriteLine($"Object: {item.Key}, content-type: {item.UserMetadata.ToList()[0].Value}"),
                ex => tcs.SetException(ex),
                () => tcs.SetResult());
            await tcs.Task.ConfigureAwait(false);

That will convert the subscription into a task and you can properly wait for it. This problem was already in the code before your PR, but it makes it hard to debug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix can be in a separate PR as it is completely unrelated.
I'll create a separate issue/PR for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If listing objects in the SDK doesn't work correctly, then I don't think we should even release it.

Copy link
Collaborator Author

@ebozduman ebozduman May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course!
I mistakenly thought that this is about some debug message I forgot to remove and the proposed change is for improvement.
Adding await requires the method to be declared with async keyword, in return that requires some other changes.
Working on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramondeklein ,
Done.
Please review it again.

Minio.Examples/Cases/ListObjects.cs Show resolved Hide resolved
@ebozduman ebozduman force-pushed the my_add-user-metadata-ListObjAsync branch from df8e59d to 61fb89a Compare April 26, 2024 09:46
@ebozduman ebozduman changed the title WIP: Add user/custom metadata in ListObjectsAsync api's return Add user/custom metadata in ListObjectsAsync api's return Apr 28, 2024
@ebozduman
Copy link
Collaborator Author

This PR is based on PR#734 (add UserMetadata to ListObjectsAsync method) and replaces it.

@ebozduman
Copy link
Collaborator Author

I'll open up new issues for the following comments, if needed:

  • Clean up parameter default values in function calls, if this doesn't conflict with the lint check, dotnet regitlint.
  • Replace IObservable<Item> with IAsyncEnunmerable. A PR#1049 (Use IAsyncEnumerable for ListObjectsAsync instead of observables) has been already created
  • .NET8.0 support PR#1050 (Add .Net8 support) has been already merged into master.
  • Use TaskCompletion and wait for the task to complete its job to prevent early return immediately after subscribing the observable object.

@ramondeklein
Could you please review again...?

Copy link
Contributor

@ramondeklein ramondeklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix compilation errors:

0>WebIdentityProvider.cs(69,13): Error IDE0059 : Unnecessary assignment of a value to 'credentials' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0059)
0>SelectEventType.cs(27,20): Error IDE0044 : Make field readonly (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0044)
0>RequestExtensions.cs(77,47): Error IDE0060 : Remove unused parameter 'errorHandlers' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0060)
0>ListObjectsItemResponse.cs(28,54): Error IDE0060 : Remove unused parameter 'args' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0060)
0>RuleFilter.cs(45,13): Error IDE0059 : Unnecessary assignment of a value to 'tag' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0059)
0>WebIdentityProvider.cs(69,13): Error IDE0059 : Unnecessary assignment of a value to 'credentials' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0059)
0>RuleFilter.cs(45,13): Error IDE0059 : Unnecessary assignment of a value to 'tag' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0059)
0>SelectEventType.cs(27,20): Error IDE0044 : Make field readonly (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0044)
0>ListObjectsItemResponse.cs(28,54): Error IDE0060 : Remove unused parameter 'args' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0060)
0>RequestExtensions.cs(77,47): Error IDE0060 : Remove unused parameter 'errorHandlers' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0060)
0>WebIdentityProvider.cs(69,13): Error IDE0059 : Unnecessary assignment of a value to 'credentials' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0059)
0>SelectEventType.cs(27,20): Error IDE0044 : Make field readonly (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0044)
0>RequestExtensions.cs(77,47): Error IDE0060 : Remove unused parameter 'errorHandlers' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0060)
0>ListObjectsItemResponse.cs(28,54): Error IDE0060 : Remove unused parameter 'args' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0060)
0>RuleFilter.cs(45,13): Error IDE0059 : Unnecessary assignment of a value to 'tag' (https://learn.microsoft.com/dotnet/fundamentals/code-analysis/style-rules/ide0059)

I can't compile the code and/or run any tests.

Minio.Examples/Cases/ListObjects.cs Show resolved Hide resolved
Minio.Examples/Program.cs Outdated Show resolved Hide resolved
@ramondeklein
Copy link
Contributor

If you think the tests are running fine, them I'm fine with merging this PR. Regarding the NuGet statistics this SDK is used by some people. I think the code is too complex and has too many issues and doesn't reflect MinIO quality standards. I'm still in favor of retiring this SDK.

ramondeklein
ramondeklein previously approved these changes Apr 29, 2024
@harshavardhana
Copy link
Member

If you think the tests are running fine, them I'm fine with merging this PR. Regarding the NuGet statistics this SDK is used by some people. I think the code is too complex and has too many issues and doesn't reflect MinIO quality standards. I'm still in favor of retiring this SDK.

Let's talk about this with @abperiasamy and @kannappanr - to see what we can do about this.

// cc @ebozduman

ramondeklein
ramondeklein previously approved these changes May 2, 2024
@ebozduman
Copy link
Collaborator Author

@ramondeklein,

I think I've addressed the rest of the comments I missed in my previous attempts.
Please review it again and approve if the last commit resolves your comments to your satisfaction.

Copy link
Contributor

@ramondeklein ramondeklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this extended behaviour of including metadata in object listings, because it will generate 1+N behaviour and that's often a recipe to ruin performance. If you need such behaviour, then you should fix your architecture. If you really want it, then implement it yourself (it can be easily done by invoking the StatObject when you get the results).

If I was the maintainer of this repo, then I would probably have answered the initial author that we won't merge his PR.

@harshavardhana What do you think? I have serious issues with this SDK and I don't think this functional change (besides from my other remarks) will make the SDK better.

Comment on lines 115 to 119
// Console.WriteLine(
// $"Percentage: {progressReport.Percentage}% TotalBytesTransferred: {progressReport.TotalBytesTransferred} bytes");
// if (progressReport.Percentage != 100)
// Console.SetCursorPosition(0, Console.CursorTop - 1);
// else Console.WriteLine();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you meant to disable this.

Copy link
Collaborator Author

@ebozduman ebozduman May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramondeklein,

I'm not a fan of this extended behaviour of including metadata in object listings, ...

MinIO SDKs support including object metadata when listing objects and it is a MinIO specific feature, meaning AWS S3 does not support, but we do.
So, I don't think we have a saying whether it needs to be included or not at this point, as this was a consciously made decision some time ago.

I am totally fine replacing the SDK completely by redesigning/re-architecting it if decided, however until that decision is made, we obviously have to support the current SDK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the Go SDK supports returning metadata if I look at this code:

https://github.com/minio/minio-go/blob/95db455478e33a69eb682dff9772494228695cb9/api-list.go#L540-L542

But that works quite different then the solution that is implemented here. It seems that you can add metadata=true as a query parameter when calling the MinIO API and it returns the metadata in the response (so no additional roundtrips are needed). The implementation that is provided in this PR works different. It will use the normal LIstObjects method and it will call StatObjectAsync for each returned object. The advantage is that it also works on AWS S3, but the drawback is that it is introducing the 1+N call problem that will result in terrible performance.

Sorry for being so critical in this PR, but I think this PR shouldn't be merged this way. If we implement it, then we should use the metadata=true query parameter and be able to deserialize the metadata from the ListObjects response instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you meant to disable this.

This was actually added and kept as an example, but I realize now that it causes confusion, so I'll remove them.
Done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add metadata=true as a query parameter when calling the MinIO API and it returns the metadata in the response

Totally agree.
I'll start replacing the existing logic and implement the logic in golang way.

Minio.Functional.Tests/Program.cs Show resolved Hide resolved
Minio/ApiEndpoints/BucketOperations.cs Outdated Show resolved Hide resolved
var statObjectArgs = new StatObjectArgs()
.WithBucket(args.BucketName)
.WithObject(itm.Key);
var objStat = await StatObjectAsync(statObjectArgs, cancellationToken)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in an 1+n complexity and may result in a lot of API requests. I think returning user meta-data is a bad idea, but if you do want to support it then place an explicit warning that this can seriously degrade performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the previous comment that we should use the metadata=true solution that is also used in the Go SDK.

Copy link
Collaborator Author

@ebozduman ebozduman May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The support decision for this feature is made for all MinIO SDKs.
So, I don't think this decision will be changed IMHO, but there is no harm discussing it.
This will be discussed with @abperiasamy, @harshavardhana and @kannappanr anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is not the feature itself, it is the way it is implemented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand.
Working on it.

@@ -78,6 +78,7 @@ public HttpRequestMessage Request
case "content-type":
try
{
val ??= "application/octet-stream";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not mandatory to specify the content-type, but if you specify it in the header, then it should be a valid one. I don't think we should assume this (although if we have to assume anything, then this is the best possible value).

Copy link
Collaborator Author

@ebozduman ebozduman May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understand what needs to be done here.
Please clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I look at this code, customers can set the Content-Type header, but if they set it to an empty value then it will be replaced with application/octet-stream. If you set the Content-Type header, then the user should set it to a valid value or the call is allowed to fail. If you don't want to set it, then just leave out the value.

This PR contains "fixes" that are well beyond the scope of the PR. So I would rollback this change, unless there is a good reason to do this.

@@ -210,8 +210,7 @@ internal static bool IsSupersetOf(IList<string> l1, IList<string> l2)
#if NET6_0_OR_GREATER
ParallelOptions parallelOptions = new()
{
MaxDegreeOfParallelism
= maxNoOfParallelProcesses
MaxDegreeOfParallelism = maxNoOfParallelProcesses, CancellationToken = CancellationToken.None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the default value, so it doesn't need to be specified (default == CancellationToken.None). A better solution would be to pass the CancellationToken to ForEachAsync and use it here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code.
Please let me know if the change addresses your comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a bit of a better look at this method and I have found some additional issues... It's a very confusing function, becuse it can be called in different ways:

  1. Sometimes it is invoked using a list of tasks (i.e. RunCoreTests).
  2. Sometimes it is invoked using an enumerator of tasks (i.e. ReuseTcpTest).

When it's not passed an enumerator, then all tasks are already running. In contrast to some other languages (i.e. Rust), a task is started upon creation. So when you add 100 tasks to a list and then invoke ForEachAsync, then all 100 tasks are already running. The only thing that is done with N threads in parallel is waiting for them to be done (which is a blocking operation). It would be much more efficient to use await Task.WhenAll(...) in these situations instead of pretending to run only N tasks in parallel.

When an enumerator is used (like in the ReuseTcpTest code), then the task created lazy. Not because tasks are lazy, but enumerators are lazy. In that case, there are only up to 8 parallel calls to GetObjectLength at a given moment. So the use of this method here is perfectly valid (if it was meant to test with only 8 calls in parallel). But, it's actually the only case where ForEachAsync would be doing what it is supposed to do.

But actually... It's not doing anything in parallel at all, because the default value of runInParallel is set to false and it's never explicitly set different. So this method is always called sequentially and I don't think this is the intention.

I would change all invocations (except in ReuseTcpTest) to:

await Task.WhenAll(tasks).ConfigureAwait(false);

There is no need to call ForEachAsync for these tasks, unless you want to print the exception.

The call in ReuseTcpTest should be changed to:

var reuseTcpConnectionTasks =
  Enumerable.Range(0, 500).Select(_ => GetObjectLength(bucket, objectName));
await reuseTcpConnectionTasks.ForEachAsync(runInParallel: true, maxNoOfParallelProcesses: 8)
                             .ConfigureAwait(false);

You may even consider to drop the runInParallel completely and assume parallel execution if maxNoOfParallelProcesses is set higher than 1. That would make it less error-prone...

PS: If you want to know all about concurrency in C#, then I can recommend the book Concurrency in C# Cookbook by Stephen Cleary. He's the absolute authority on this topic and it's a great book. Highly recommended...

@@ -1040,7 +1039,7 @@ public static T DeserializeXml<T>(Stream stream) where T : class, new()
if (stream == null || stream.Length == 0) return default;

var ns = GetNamespace<T>();
if (!string.IsNullOrWhiteSpace(ns) && string.Equals(ns, "http://s3.amazonaws.com/doc/2006-03-01/",
if (string.Equals(ns, "http://s3.amazonaws.com/doc/2006-03-01/",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already incorrect, but XML namespace prefixes and URIs are case sensitve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -20,9 +20,10 @@ namespace Minio.Helper;

public class AmazonAwsS3XmlReader : XmlTextReader
{
public new string NamespaceURI = "http://s3.amazonaws.com/doc/2006-03-01/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is new needed here? It's generally bad to override existing values.

Copy link
Collaborator Author

@ebozduman ebozduman May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new is enforced by the VSCode's inline suggestion tool.
I removed it and replaced it with override, which I guess, was the purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file needs to be reverted. Using the new operator on the NamespaceURI property actual does what it is named. It creates a NEW property and doesn't override the exiting one. So if base classes refer to NamespaceURI, then they will still access the OLD one. Only in your class and later, it will use the NEW property. It's typically bad if you need the new keyword here, so please revert these changes.

This change is also bad, because you need to specify the same URL twice.

Copy link
Collaborator Author

@ebozduman ebozduman May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramondeklein,
newis replaced with an override.

@@ -54,4 +48,6 @@ public string ETag
return dt;
}
}

[XmlElement] public List<MetadataItem> UserMetadata { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a dictionary would be more logical, because:

  1. You can use the index operator [] to fetch a specific key.
  2. No key can every be specified more than once.

@ebozduman ebozduman force-pushed the my_add-user-metadata-ListObjAsync branch from ece5d42 to 827f38f Compare May 22, 2024 05:58
@ebozduman
Copy link
Collaborator Author

Closing this PR.
@ramondeklein's new approach (minio-dotnet2) solution for LisstObjectsAsync returning the user defined metadata information when desired.

@ebozduman ebozduman closed this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants