Skip to content

Tracking how long it takes for content published by news organisations to be available in Google search

License

Notifications You must be signed in to change notification settings

guardian/google-search-indexing-observatory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google Search Indexing Observatory

How long does it take for content published by news organisations to be available in Google search?

This broadens Ophan's Google Search Index Checker to check for content published by many news organisations, not just the Guardian. We're trying to work out if the intermittent multi-hour delays we've seen for some Guardian articles to be available in Google Search are typical for other news organisations too, or if there's actually something particular to the Guardian that needs to be fixed.

It's an 'observatory' in the same way that the EFF SSL Observatory is - creating and collating observations of distant sites and processes that are visible to us but beyond our control.

Steps performed by the Observatory

  1. Fetch the Sitemap XML for a news site
  2. Hit the Google Custom Search Site Restricted JSON API to check if the content listed is available in Google search. API Consumption & Cost 💰💰💰 for this can be monitored in the Google Cloud console.
  3. Stores whether each article is available (or not) in an AWS DynamoDb table.

Running the Checker locally

Pre-requisites

These mostly match the pre-requisites for running Ophan locally - specifically Java 11 & sbt, but also especially the requirement to have ophan AWS credentials from Janus.

Running the Lambda locally

Execute this on the command line:

$ sbt run

About

Tracking how long it takes for content published by news organisations to be available in Google search

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published