CRWLR

Crawls ads.txt from the given list of URLs, parses data and provides an endpoint to retrieve the collected dataset.

Run
Key Capabilities & Known Issues
Build

Run

Run Instructions

JRE 8 is required.
launch via command line:
java -jar -DPORT=8080 -DPUBLISHERS=http://example.com/ads.txt,http://abc.com/ads.txt ads-crawler-0.0.2.jar
- PORT - to listen, optional parameter
- PUBLISHERS - additional ads.txt URLs to crawl, comma-separated, optional parameter. The defaults are:

Usage

http://localhost:<port>/publishers - get a list of supported publishers
Example: GET http://localhost:8080/publishers

[
    {
        "name": "www.cnn.com",
        "url": "http://www.cnn.com/ads.txt"
    },
    {
        "name": "www.gizmodo.com",
        "url": "http://www.gizmodo.com/ads.txt"
    },
    {
        "name": "www.nytimes.com",
        "url": "http://www.nytimes.com/ads.txt"
    },
    {
        "name": "www.bloomberg.com",
        "url": "https://www.bloomberg.com/ads.txt"
    },
    {
        "name": "wordpress.com",
        "url": "https://wordpress.com/ads.txt"
    }
]

http://localhost:<port>/publishers/<name> - get dataset by the publisher name
Example: GET http://localhost:8080/publishers/www.bloomberg.com

[
    {
        "accountId": "8603",
        "domain": "advertising.com",
        "relationship": "RESELLER"
    },
    {
        "accountId": "8355",
        "domain": "appnexus.com",
        "relationship": "DIRECT"
    },
    {
        "accountId": "540158162",
        "authority": "6a698e2ec38604c6",
        "domain": "openx.com",
        "relationship": "DIRECT"
    }
]

Key Capabilities & Known Issues

Features

Parses ads.txt according to IAB Specification with some extra improvements:
- case-insensitive Relationship values are supported ('direct'/'Direct')
- ignores duplicate records.
Supports HTTP redirects when fetching the content.

Unsupported

Cyclic redirect protection
RFC-1123 domain validation
Realtime dataset updates (you have to restart the server to get the latest ads.txt changes)
Physical storage (all data is kept in H2 RAM DB).

Build

SBT is required.
run command line: sbt assembly
take ./target/scala-2.12/ads-crawler-0.0.2.jar (fat jar)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
project		project
src		src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
README.MD		README.MD
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

project

project

src

src

.gitignore

.gitignore

.scalafmt.conf

.scalafmt.conf

README.MD

README.MD

build.sbt

build.sbt

Repository files navigation

CRWLR

Run

Run Instructions

Usage

Key Capabilities & Known Issues

Features

Unsupported

Build

About

Releases

Packages

Languages

antonantsyferov/crwlr

Folders and files

Latest commit

History

Repository files navigation

CRWLR

Run

Run Instructions

Usage

Key Capabilities & Known Issues

Features

Unsupported

Build

About

Resources

Stars

Watchers

Forks

Languages