User Agent Configuration

The configuration of the user agent in StormCrawler has 2 purposes:

Identification of the crawler for webmasters
Selection of rules from robots.txt

Crawler Identification

The politeness of a web crawler is not limited to how frequently it fetches pages from a site, but also in how it identifies itself to sites it crawls. This is done by setting the HTTP header User-Agent, just like your web browser does.

The full user agent string is built from the concatenation of the configuration elements:

http.agent.name: name of your crawler
http.agent.version: version of your crawler
http.agent.description: description of what it does
http.agent.url: URL webmasters can go to to learn about it
http.agent.email: an email so that they can get in touch with you

Whereas StormCrawler used to provide a default value for these, this is not the case since version 2.11 and you will now be asked to provide a value.

You can specify the user agent verbatim with the config http.agent but you will still need to provide a http.agent.name for parsing robots.txt files.

Robots Exclusion Protocol

This is also known as the robots.txt protocol, it is formalised in RFC 9309. Part of what the robots directives does is to define rules to specify which parts of a website (if any) are allowed to be crawler. The rules are organised by User-Agent, with a * to match any agent not otherwise specified explicitly e.g.

 User-Agent: *
 Disallow: *.gif$
 Disallow: /example/
 Allow: /publications/

In the example above the rule allows access to the URLs with the /publications/ path prefix, and it restricts access to the URLs with the /example/ path prefix and to all URLs with a .gif suffix. The "*" character designates any character, including the otherwise-required forward slash;

The value of http.agent.name is what StormCrawler looks for in the robots.txt. It MUST contain only uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-").

Unless you are running a well know web crawler, it is unlikely that its agent name will be listed explicitly in the robots.txt (if it is, well, congratulations!). While you want the agent name value to reflect who your crawler is, you might want to follow rules set for better known crawlers. For instance, if you were a responsible AI company crawling the web to build a dataset to train LLMs, you would want to follow the rules set for Google-Extended (see list of Google crawlers) if any were found.

This is what the configuration http.robots.agents allows you to do. It is a comma separated string but can also take a list of values. By setting it alongside http.agent.name (which should also be the first value it contains), you are able to broaden the match rules based on the identity as well as the purpose of your crawler.

Start
Components
- Filters
  - ParseFilters
  - URLFilters
- Bolts
- Protocol
  - HTTPProtocol
  - Protocols
- Metadata
  - MetadataTransfer
Resources
- Powered-By
- Presentations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User Agent Configuration

Crawler Identification

Robots Exclusion Protocol

Clone this wiki locally