Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HtmlAttributeCollection: Hashitems and therefore the this[string name] indexer getting out of sync when removing one of multiple attributes with same name #464

Open
elgonzo opened this issue Jan 15, 2022 · 3 comments
Assignees

Comments

@elgonzo
Copy link
Contributor

elgonzo commented Jan 15, 2022

1. Description

The this[string name] indexer of the HtmlAttributeCollection does not operate on the internal items list, but operates on a separate internal dictionary Hashitems instead. For this[string name] to behave correctly, it is therefore crucial to keep Hashitems accurately in sync whenever and however the items collection is being modified.

Unfortunately, the Hashitems dictionary is not maintained properly when removing items from the HtmlAttributeCollection. I already filed an issue about one such problem scenario yesterday (#463). Later i found another problem scenario involving other methods of HtmlAttributeCollection that lead to Hashitems getting out of sync with items and therefore the this[string name] indexer operating incorrectly, and which i am going to describe and demonstrate below.

This problem scenario relies on an element/node having multiple attributes with the same name, like for example:

<div class="One" class="Two" class="Three">Hello World</div>

Removing just one of these class attributes will cause the this[string name] indexer to act as if there were no other class attributes existing in the <div> node, incorrectly returning null despite the HtmlAttributeCollection still possessing class attributes.

This problem scenario can be reproduced using either the Remove(HtmlAttribute) or the RemoveAt(int) method.

2. Fiddle or Project

Demo using Remove(HtmlAttribute): https://dotnetfiddle.net/5y7cH7

Click to expand the source code
using HtmlAgilityPack;
using System;

namespace HtmlAttributeCollection_HashItemsGettingOutOfSyncWithRemove
{
    class Program
    {
		static void Main(string[] args)
		{
			var html = @"<div class=""One"" class=""Two"" class=""Three"">Hello World</div>";

			var htmlDoc = new HtmlDocument();
			htmlDoc.LoadHtml(html);

			var divNode = htmlDoc.DocumentNode.SelectSingleNode("//div");

			var a1 = divNode.Attributes["class"];
			var str1 = (a1 is null) ? "<NULL>" : $"{a1.Name}={a1.Value}";
			Console.WriteLine($"divNode.Attributes[\"class\"] before Remove(HtmlAttribute): {str1}");

			var secondClassAttribute = divNode.Attributes[1];
			Console.WriteLine($"Second \"class\" attribute in collection: {secondClassAttribute.Name}={secondClassAttribute.Value}");

			Console.WriteLine();
			Console.WriteLine("Executing divNode.Attributes.Remove(secondClassAttribute)");
			Console.WriteLine();
			divNode.Attributes.Remove(secondClassAttribute);

			var a2 = divNode.Attributes["class"];
			var str2 = (a2 is null) ? "<NULL>" : $"{a2.Name}={a2.Value}";
			Console.WriteLine($"divNode.Attributes[\"class\"] after Remove(HtmlAttribute): {str2}");

			Console.WriteLine();
			Console.WriteLine("All attributes in divNode.Attributes after removal of the second \"class\" attribute:");
			foreach (var attr in divNode.Attributes)
				Console.WriteLine($"\t{attr.Name}={attr.Value}");
		}
	}
}

Demo using RemoveAt(int): https://dotnetfiddle.net/BJQQkT

Click to expand the source code
using HtmlAgilityPack;
using System;

namespace HtmlAttributeCollection_HashItemsGettingOutOfSyncWithRemoveAt
{
    class Program
    {
		static void Main(string[] args)
		{
			var html = @"<div class=""One"" class=""Two"" class=""Three"">Hello World</div>";

			var htmlDoc = new HtmlDocument();
			htmlDoc.LoadHtml(html);

			var divNode = htmlDoc.DocumentNode.SelectSingleNode("//div");

			var a1 = divNode.Attributes["class"];
			var str1 = (a1 is null) ? "<NULL>" : $"{a1.Name}={a1.Value}";
			Console.WriteLine($"divNode.Attributes[\"class\"] before Remove(HtmlAttribute): {str1}");

			var secondClassAttribute = divNode.Attributes[1];
			Console.WriteLine($"Second \"class\" attribute in collection: {secondClassAttribute.Name}={secondClassAttribute.Value}");

			Console.WriteLine();
			Console.WriteLine("Executing divNode.Attributes.RemoveAt(1)");
			Console.WriteLine();
			divNode.Attributes.RemoveAt(1);

			var a2 = divNode.Attributes["class"];
			var str2 = (a2 is null) ? "<NULL>" : $"{a2.Name}={a2.Value}";
			Console.WriteLine($"divNode.Attributes[\"class\"] after Remove(HtmlAttribute): {str2}");
			
			Console.WriteLine();
			Console.WriteLine("All attributes in divNode.Attributes after removal of the second \"class\" attribute:");
			foreach (var attr in divNode.Attributes)
				Console.WriteLine($"\t{attr.Name}={attr.Value}");
		}
	}
}

In both demos, divNode.Attributes["class"] yields the class="Three" attribute before the removal operation is attempted.

Then both demos proceed to remove the second class attribute class="Two".

After the removal of the second class attribute, divNode.Attributes["class"] incorrectly yields null, despite the <div> node possessing two class attributes at that point (class="One" and class="Three").

3. Any further technical details

  • HAP version: 1.11.40
  • NET 5.0

Side note: It seems to me the behavior of the this[string name] indexer is not completely and unambiguously specified when dealing with multiple attributes having the same name. What is its intended behavior here? Should it always return the last in the order of the attributes with the same name? The first? Should it return the last accessed attribute of the same name (but what if the last such access was removal of such an attribute, and what would be the initial state when there have been no attributes accessed yet)? Should it remain random or undefined, with you basically knowing only that you are getting some attribute with the specified name, but you are rolling the dice on which exactly you get, and hope and cross fingers that it is the one you actually wanted? Some other behavior my weekend-addled brain can't think of right now....? ;-)

@JonathanMagnan JonathanMagnan self-assigned this Jan 17, 2022
@JonathanMagnan
Copy link
Member

Hello @elgonzo ,

Thank you for reporting,

We will look at it.

Best Regards,

Jon


Sponsorship
Help us improve this library

Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework ExtensionsBulk OperationsDapper Plus

Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval FunctionSQL Eval Function

@JonathanMagnan
Copy link
Member

Hello @elgonzo ,

Having multiple attributes with the same name is not really legal in HTML.

Even major browsers will only read the first attribute and don't show other ones.

So even if we currently have weird behavior, unless I miss something, I do not think that we really need to support this case as this is not a valid case. So getting weird behavior is expected here.

Let me know if that makes sense.

Best Regards,

Jon

@elgonzo
Copy link
Contributor Author

elgonzo commented Jan 20, 2022

I strongly disagree with keeping the current behavior. Here is why:

HtmlAttributeCollection is able and does maintain multiple attributes of the same name.
Iterating over the HtmlAttributeCollection will yield all of the attributes of the same name remaining in the collection, even after removing one of those attributes.
Accessing each of the attributes through the numeric indexer works just fine, even after removing one of the attributes having the same name.

The point is not whether multiple attributes of the same name are illegal in HTML, and thus weird behavior should be accepted. (If that were the point, why was #461 being addressed and fixed, even though the very same argument could have been made there?)

No, the point is that the behavior of the string indexer this[string name] is inconsistent with the other functions of HtmlAttributeCollection itself. That's the real issue here. Instead of "weird behavior" i would prefer to either see the this[string name] indexer fixed or disallow HtmlAttributeCollection from having multiple attributes with the same name. Keeping HtmlAttributeCollection APIs with unexpected, surprising, and most importantly inconsistent behavior is the worst choice to make in my opinion.

Even major browsers will only read the first attribute and don't show other ones.

If that's the case, i would suggest the this[string name] indexer always provide the left-most attribute with that name (i.e., the one with the lowest index in the internal items collection.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants