Skip to content

antonantsyferov/crwlr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRWLR

Crawls ads.txt from the given list of URLs, parses data and provides an endpoint to retrieve the collected dataset.

Run

Download Executable JAR

Run Instructions

Usage

  • http://localhost:<port>/publishers - get a list of supported publishers
    Example: GET http://localhost:8080/publishers
    [
        {
            "name": "www.cnn.com",
            "url": "http://www.cnn.com/ads.txt"
        },
        {
            "name": "www.gizmodo.com",
            "url": "http://www.gizmodo.com/ads.txt"
        },
        {
            "name": "www.nytimes.com",
            "url": "http://www.nytimes.com/ads.txt"
        },
        {
            "name": "www.bloomberg.com",
            "url": "https://www.bloomberg.com/ads.txt"
        },
        {
            "name": "wordpress.com",
            "url": "https://wordpress.com/ads.txt"
        }
    ]
    
  • http://localhost:<port>/publishers/<name> - get dataset by the publisher name
    Example: GET http://localhost:8080/publishers/www.bloomberg.com
    [
        {
            "accountId": "8603",
            "domain": "advertising.com",
            "relationship": "RESELLER"
        },
        {
            "accountId": "8355",
            "domain": "appnexus.com",
            "relationship": "DIRECT"
        },
        {
            "accountId": "540158162",
            "authority": "6a698e2ec38604c6",
            "domain": "openx.com",
            "relationship": "DIRECT"
        }
    ]  
    

Key Capabilities & Known Issues

Features

  • Parses ads.txt according to IAB Specification with some extra improvements:
    • case-insensitive Relationship values are supported ('direct'/'Direct')
    • ignores duplicate records.
  • Supports HTTP redirects when fetching the content.

Unsupported

  • Cyclic redirect protection
  • ​RFC-1123 domain validation
  • Realtime dataset updates (you have to restart the server to get the latest ads.txt changes)
  • Physical storage (all data is kept in H2 RAM DB).

Build

  • SBT is required.
  • run command line: sbt assembly
  • take ./target/scala-2.12/ads-crawler-0.0.2.jar (fat jar)