Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLNodeSet::xpath and XMLNodeSet::css are very slow with large number of nodes #1795

Closed
jvshahid opened this issue Sep 15, 2018 · 1 comment · Fixed by #1796
Closed

XMLNodeSet::xpath and XMLNodeSet::css are very slow with large number of nodes #1795

jvshahid opened this issue Sep 15, 2018 · 1 comment · Fixed by #1796
Milestone

Comments

@jvshahid
Copy link
Member

This is extracted from #1792 (comment)

Here is the test script I am using:

require 'nokogiri'

items = %Q(
<div class="test">
  <a href="something">derp</a>
</div>
) * ARGV[0].to_i

html = %Q(
<html>
  <body>
    #{items}
  </body>
</html>
)
doc = Nokogiri::HTML html
puts "pid is #{$$}"

start = Time.now
links = doc.css('html body .test a')
puts "Without nesting got #{links.length} links in #{Time.now - start}s" # This will return immediately

start = Time.now
links = doc.css('html body .test').css('a')
puts "With nesting got #{links.length} links in #{Time.now - start}s" # This will take a LONG time to finish

Here is the output with 5000 divs:

pid is 49017
Without nesting got 5000 links in 0.225391s
With nesting got 5000 links in 2.666568s

And here is the output with 50,000 divs:

pid is 49113
Without nesting got 50000 links in 0.7900900000000001s
With nesting got 50000 links in 296.062414s

So in both cases the nested version is several orders of magnitude slower than the unnested version. And it seems to get exponentially worse for larger numbers of matched elements.

@jvshahid
Copy link
Member Author

Based on the profiling I did, it looks like RubyArray#op_or is where we spend most of the time. To be more specific most of the time is spent calculating hashes of the objects here.

jvshahid added a commit that referenced this issue Sep 15, 2018
The current implementation suffers from having poor performance.  More
specifically, `op_or' is order of magnitude slower than the MRI version slows
down XMLNodeSet::xpath and XMLNodeSet::css significantly.  This
commit rewrites the class to manage its own internal array, similar to what
libxml does.

fixes #1795
@flavorjones flavorjones added this to the next milestone Dec 1, 2018
flavorjones pushed a commit that referenced this issue Dec 1, 2018
The current implementation suffers from having poor performance.  More
specifically, `op_or' is order of magnitude slower than the MRI version slows
down XMLNodeSet::xpath and XMLNodeSet::css significantly.  This
commit rewrites the class to manage its own internal array, similar to what
libxml does.

fixes #1795
flavorjones pushed a commit that referenced this issue Dec 1, 2018
The current implementation suffers from having poor performance.  More
specifically, `op_or' is order of magnitude slower than the MRI version slows
down XMLNodeSet::xpath and XMLNodeSet::css significantly.  This
commit rewrites the class to manage its own internal array, similar to what
libxml does.

fixes #1795
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants