grubby

Fail-fast web scraping. grubby adds a layer of utility and error-checking atop the marvelous Mechanize gem. See API listing below, or browse the full documentation.

Examples

The following code scrapes stories from the Hacker News front page:

require "grubby"

class HackerNews < Grubby::PageScraper
  scrapes(:items) do
    page.search!(".athing").map{|element| Item.new(element) }
  end

  class Item < Grubby::Scraper
    scrapes(:story_link){ source.at!("a.storylink") }

    scrapes(:story_url){ expand_url(story_link["href"]) }

    scrapes(:title){ story_link.text }

    scrapes(:comments_link, optional: true) do
      source.next_sibling.search!(".subtext a").find do |link|
        link.text.match?(/comment|discuss/)
      end
    end

    scrapes(:comments_url, if: :comments_link) do
      expand_url(comments_link["href"])
    end

    scrapes(:comment_count, if: :comments_link) do
      comments_link.text.to_i
    end

    def expand_url(url)
      url.include?("://") ? url : source.document.uri.merge(url).to_s
    end
  end
end

# The following line will raise an exception if anything goes wrong
# during the scraping process.  For example, if the structure of the
# HTML does not match expectations due to a site change, the script will
# terminate immediately with a helpful error message.  This prevents bad
# data from propagating and causing hard-to-trace errors.
hn = HackerNews.scrape("https://news.ycombinator.com/news")

# Your processing logic goes here:
hn.items.take(10).each do |item|
  puts "* #{item.title}"
  puts "  #{item.story_url}"
  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
  puts
end

Hacker News also offers a JSON API, which may be more robust for scraping purposes. grubby can scrape JSON just as well:

require "grubby"

class HackerNews < Grubby::JsonScraper
  scrapes(:items) do
    # API returns array of top 500 item IDs, so limit as necessary
    json.take(10).map do |item_id|
      Item.scrape("https://hacker-news.firebaseio.com/v0/item/#{item_id}.json")
    end
  end

  class Item < Grubby::JsonScraper
    scrapes(:story_url){ json["url"] || hn_url }

    scrapes(:title){ json["title"] }

    scrapes(:comments_url, optional: true) do
      hn_url if json["descendants"]
    end

    scrapes(:comment_count, optional: true) do
      json["descendants"]&.to_i
    end

    def hn_url
      "https://news.ycombinator.com/item?id=#{json["id"]}"
    end
  end
end

hn = HackerNews.scrape("https://hacker-news.firebaseio.com/v0/topstories.json")

# Your processing logic goes here:
hn.items.each do |item|
  puts "* #{item.title}"
  puts "  #{item.story_url}"
  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
  puts
end

Core API

Grubby
Scraper
- .each
- .scrape
- .scrapes
- #[]
- #to_h
PageScraper
- .scrape_file
- #page
JsonScraper
- .scrape_file
- #json
Mechanize::File
- #save_to
- #save_to!
Mechanize::Page
- #at!
- #search!
Mechanize::Page::Link
- #to_absolute_uri
URI
- #basename
- #query_param

Auxiliary API

grubby loads several gems that extend Ruby objects with utility methods. Some of those methods are listed below. See each gem's documentation for a complete API listing.

Installation

Install the grubby gem.

Contributing

Run rake test to run the tests.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.github/workflows		.github/workflows
lib		lib
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
grubby.gemspec		grubby.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

lib

lib

test

test

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

Gemfile

Gemfile

LICENSE.txt

LICENSE.txt

README.md

README.md

Rakefile

Rakefile

grubby.gemspec

grubby.gemspec

Repository files navigation

grubby

Examples

Core API

Auxiliary API

Installation

Contributing

License

About

Releases

Packages

Languages

License

jonathanhefner/grubby

Folders and files

Latest commit

History

Repository files navigation

grubby

Examples

Core API

Auxiliary API

Installation

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages