Skip to content
This repository has been archived by the owner on Nov 3, 2021. It is now read-only.

A dockerizable module designed to prerender cache web pages to s3

License

Notifications You must be signed in to change notification settings

tournamentmgr/Sitemap-Prerendering-S3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Actions Status Code Coverage

Sitemap Prerendering

This module was designed to run as a prerender client that caches to s3. Utilizing either local or docker to render webpages, which are then posts the rendered static HTML page to S3. The idea behind this is to allow for a place for bots to scan static html pages.

Prereqs

Development

If developing, ensure to install the requirements.txt file.

pip install -r requirements.txt

Utilization

Docker

docker build -t prerender .

docker run -e AWS_ACCESS_KEY_ID=AWSKEY -e AWS_SECRET_ACCESS_KEY=AWSSECRET -t prerender -i python -c "from prerender.prerender import Prerender; Prerender(#Options).capture()"

Local Installation

Install the modules:

python scraper/setup.py install

python prerender/setup.py install


Create Python Code

from prerender.prerender import Prerender

pre = Prerender( # Options )

Options

Required Variable Info
True robots_url The path to your root robots file. This will contain the sitemap info
True s3_bucket Cache Archive bucket name
False auth Utilized for basic authenticating to page.
False query_char_deliminator (recommended) - Character to replace the question mark. If storing static pages, AWS doesnt allow you to have ? in a file to serve the content. So changing to a different character will fix this. Ex) /subpage?id=1 and your query_char_deliminator is '#', your page will be stored as /subpage#id=1
False allowed_domains List of domains to allow. If specified all other domains will be blocked during the page capturing.

Module invocation

Invalidate/Clear bucket:

pre.invalidate()

Capture from sitemaps within Robots.txt

pre.capture()

Single Page Capture

If you prefer to capture a single page, versus a full domain.

pre.capture_page_and_upload("https://example.com")

About

A dockerizable module designed to prerender cache web pages to s3

Resources

License

Stars

Watchers

Forks

Releases

No releases published