Skip to content

Selenium + Headless Chrome scraper that calculates actual full web page sizes (including dynamic content).

Notifications You must be signed in to change notification settings

jorgeorpinel/site-page-size-scraper

Repository files navigation

For Webpages Are Getting Larger Every Year, and Here’s Why it Matters
Author: Jorge Orpinel Perez
© 2018 Pingdom AB.

Website Page Size Scraper

Python script that uses Selenium and Headless Chrome to determine the average page size among a list of websites. This will include transferSize AND any other content loaded dynamically to display the home page of each site.

Installation

This tool was developed and ran with Python 3.6.5 on macOS 10.13

Further versions should continue to work.

External dependencies

Required Python package

See requirements.txt

  • Python language bindings for Selenium WebDriverselenium 3.14 used

To install, we will use virtualenv:

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Virtualenv installs pip automatically.

Usage

Save a list of web page URIs (one per line) in a plain text file. Included in 2018-09-15-alexa-topsites-50-preview.txt is a sample list of 50 top sites published by Alexa (Sep 2018).
Make sure the script is executable by your user:

chmod u+x from_list.py

You may now run it:

chromedriver 2> /dev/null &  # Implies --remote-debugging-port=9515. Runs in background.
./from_list.py 2018-09-15-alexa-top-sites-50.txt

See the file docstring in from_list.py for further info.

Don't forget to stop chromedriver after running the Python script e.g.:

fg  # To bering chromedriver to the background
^C  # Ctrl+C

About

Selenium + Headless Chrome scraper that calculates actual full web page sizes (including dynamic content).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages