Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this library support customize function for xpath? #144

Open
zmjack opened this issue Feb 14, 2018 · 4 comments · May be fixed by #342
Open

Does this library support customize function for xpath? #144

zmjack opened this issue Feb 14, 2018 · 4 comments · May be fixed by #342
Assignees

Comments

@zmjack
Copy link
Contributor

zmjack commented Feb 14, 2018

Does this library support customize function for xpath?

For example, there is a html string:

<div id="info"></div>
<div id="category_1"></div>
<div id="category_2"></div>
<div id="output"></div>

I want to find out all <div> which is start with category. In this case, they are category_1 and category_2.

Therefore, I need to customize a function that is defined as match in the namespace fn. So that I can do this work with the following xpath expressions:

//div[fn:match(@id, 'category_\d+')]

But it not seem to be supported in the current version. If so, I think I need to create a pull request to support for this feature.

Looking forward to your reply.

@JonathanMagnan JonathanMagnan self-assigned this Feb 14, 2018
@zmjack
Copy link
Contributor Author

zmjack commented Feb 22, 2018

I received your email and tried to use the new library.

It works very well. Before, I use my extension library. It looks like this:

public static HtmlNodeCollection SelectNodes(this HtmlNode @this, XPathExpression xpath)
{
    // Reflection
    var _ownerdocument = (HtmlDocument)typeof(HtmlNode)
        .GetField("_ownerdocument", BindingFlags.Instance | BindingFlags.NonPublic)
        .GetValue(@this);

    var nav = (HtmlNodeNavigator)typeof(HtmlNodeNavigator)
        .GetConstructor(BindingFlags.Instance | BindingFlags.NonPublic, null, new Type[] { typeof(HtmlDocument), typeof(HtmlNode) }, null)
        .Invoke(new object[] { _ownerdocument, @this });
    // End

    HtmlNodeCollection list = new HtmlNodeCollection(null);

    XPathNodeIterator it = nav.Select(xpath);
    while (it.MoveNext())
    {
        HtmlNodeNavigator n = (HtmlNodeNavigator)it.Current;
        list.Add(n.CurrentNode);
    }

    if (list.Count == 0)
    {
        return null;
    }

    return list;
}

Now, I needn't to use my extension library any more, but I use another library which called Dawnx.Xml to generate XPathExpression. Because the XsltContext in the standard library is difficult to use, I create another library.

Of course, you can use System.Xml.Xsl.XsltContext to compile XPathExpression. It doesn't matter, we just need a XPathExpression.

This is the code I used to solve the problem before using HtmlAgilityPack 1.7.0.

class MyContext : XPathContext
{
    public override string DefaultNamespace => "http://uri";

    [XPathFunction("match", XPathResultType.NodeSet, XPathResultType.String)]
    public bool RegexMatch(string prop, string regex, XPathNavigator docContext)
    {
        return new Regex(regex).Match(prop).Success;
    }
}

static void Main(string[] args)
{
    var doc = new HtmlDocument();
    doc.LoadHtml(@"<div id=""info"">info</div>
<div id=""category_1"">category_1</div>
<div id=""category_2"">category_2</div>
<div id=""output"">output</div>");

    var context = new MyContext();
    context.AddNamespace("fn", context.DefaultNamespace);
    context.AddParam("regex", context.DefaultNamespace, @"category_\d+");

    var xpath = context.Compile(@"//div[fn:match(@id, $fn:regex)]");
    var nodes = doc.DocumentNode.SelectNodes(xpath);

    foreach (var node in nodes)
    {
        Console.WriteLine(HttpUtility.HtmlDecode(node.InnerHtml));
    }
}

The result is:
category_1
category_2

Thank you. Have a nice day!

@JonathanMagnan
Copy link
Member

That's great ;)

I will try to look at your library once it gets released.

Best Regards,

Jonathan

@zmjack
Copy link
Contributor Author

zmjack commented Nov 10, 2019

Hello!
I just created a pull request to solve this issue. Here is the test code:

class MyContext : XPathContext
{
    public MyContext() { }
    public MyContext(string prefix) : base(prefix) { }

    [XPathFunction("match")]
    public bool RegexMatch1(string content, string regex)
    {
        Console.WriteLine($"  * Invoke: {nameof(RegexMatch1)}");
        return new Regex(regex).Match(content).Success;
    }

    [XPathFunction("match")]
    public bool RegexMatch2(string regex, XPathNavigator docContext)
    {
        Console.WriteLine($"  * Invoke: {nameof(RegexMatch2)}");
        return new Regex(regex).Match(docContext.InnerXml).Success;
    }
}

class Program
{
    static void Main(string[] args)
    {
        var html = @"
<div id=""info"">hello</div>
<div class=""category1"">category_1</div>
<div class=""category2"">category_2</div>
<div id=""output"">bye</div>";

        var doc = new HtmlDocument();
        doc.LoadHtml(html);
        var ctx = new MyContext("re");

        Console.WriteLine($"HTML:{html}");
        Console.WriteLine();

        // Samples
        var xpaths = new[]
        {
            @"//div[re:match(@class, 'category\d+')]",
            @"//div[re:match('(hello|bye)')]",
        };
        foreach (var xpath in xpaths)
        {
            Console.WriteLine($"XPath: {xpath}");
            var nodes = doc.DocumentNode.SelectNodes(ctx[xpath]);
            foreach (var node in nodes)
                Console.WriteLine(node.InnerHtml);
            Console.WriteLine();
        }
    }
}

Console output:

HTML:
<div id="info">hello</div>
<div class="category1">category_1</div>
<div class="category2">category_2</div>
<div id="output">bye</div>

XPath: //div[re:match(@class, 'category\d+')]
  * Invoke: RegexMatch1
  * Invoke: RegexMatch1
  * Invoke: RegexMatch1
  * Invoke: RegexMatch1
category_1
category_2

XPath: //div[re:match('(hello|bye)')]
  * Invoke: RegexMatch2
  * Invoke: RegexMatch2
  * Invoke: RegexMatch2
  * Invoke: RegexMatch2
hello
bye

@JonathanMagnan
Copy link
Member

Thank you @zmjack ,

We will review it.

Best Regards,

Jonathan


Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval FunctionSQL Eval Function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants