Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add example to tutorials: how to grab an HTML section (between like headers) #28

Open
flavorjones opened this issue Aug 28, 2020 · 0 comments

Comments

@flavorjones
Copy link
Member

I wrote this up to answer somebody's question in the past year, but I can't remember who or where. I think it's a good example of moving from a short-but-specific-solution to a longer-but-general-solution and hopefully teaches folks about custom XPath handlers and XPath queries along the way.

#! /usr/bin/env ruby
#
#  TODO: put this in the nokogiri.org tutorials
#

require "nokogiri"

html = <<~EOF
  <html>
  <body>
    <h1>My Fakepedia Page</h1>
    <div id="bodyContent">
      <div id="mw-content-text">
        <div class="mw-parser-output">
          <div id="toc">...</div>

          <h2>Background</h2>
          <p>This is uninteresting content and you don't want to scrape it.</p>

          <h2>Good Stuff</h2>
          <p>This is the good stuff.</p>
          <p>You really want to scrape just this section.</p>

          <h2>Unrelated Stuff</h2>
          <p>This is where the author has gone off on a tangent.</p>

          <h2>References</h2>
          <p>Snoozapalooza.</p>
        </div>
      </div>
    </div>
EOF

doc = Nokogiri::HTML(html)

#
#  solution 1 - simple XPath, process results in Ruby
#
#  i think you will agree, this is ugly code and makes a lot of
#  implicit assumptions about the structure of the document.
#
#  don't do this. better alternatives are provided below.
#
node_set = doc.css("div.mw-parser-output").children

# look forward until we get to the h2 that we want
start_index = 0
while !(node_set[start_index].name == "h2" && node_set[start_index].content == "Good Stuff")
  start_index += 1
end
start_index += 1

# look forward until we get to the next h2
end_index = start_index
while node_set[end_index + 1].name != "h2"
  end_index += 1
end

# slice the node set
puts node_set[start_index..end_index]
puts "-----"

#
#  solution 2 - using an XPath function to perform set intersection
#
#  this is much cleaner code, but still makes an assumption about the
#  structure of the document.
#
#  a better alternative is provided below
#
class XPathIntersection
  def self.intersection(set1, set2)
    set1 & set2 # in ruby, return the intersection of the NodeSets
  end
end

xpath_query = <<~EOX
  intersection(//h2[text()='Good Stuff']/following-sibling::*,
               //h2[text()='Unrelated Stuff']/preceding-sibling::*)
EOX

puts doc.xpath(xpath_query, XPathIntersection)
puts "-----"

#
#  solution 3 - write a method to introspect on the document and use
#  more XPath queries to find the section boundary and return only the
#  nodes within the section.
#
#  note that it works:
#  - for any header level (h1, h2, h3, et al)
#  - even if the header is the last one in the section
#  - only requires knowing the text of the header you care about
#
#  it uses:
#  - Node#path which returns an XPath query that points just to this node
#  - Node#name which returns the tag of the node (e.g., "h2", "div")
#
class XPathHeaderSection
  def self.header_section(node_set)
    document = node_set.document
    header = node_set.first

    # grab siblings that follow the target header
    following_siblings_query = "#{header.path}/following-sibling::*"
    following_siblings = document.xpath(following_siblings_query)

    # check if there's a next header of the same type that's a sibling
    next_header_query = "#{header.path}/following-sibling::#{header.name}"
    next_header = document.at_xpath(next_header_query)

    if next_header
      preceding_siblings_query = "#{next_header.path}/preceding-sibling::*"
      preceding_siblings = document.xpath(preceding_siblings_query)

      following_siblings & preceding_siblings # xpath intersection
    else
      following_siblings
    end
  end
end

puts XPathHeaderSection.header_section(doc.xpath("//h2[text()='Good Stuff']"))

# note that you can also call this method as an XPath function
puts doc.xpath("header_section(//h2[text()='Good Stuff'])", XPathHeaderSection)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@flavorjones and others