Skip to content

Configuration

Julien Nioche edited this page Jan 2, 2024 · 6 revisions

This document describes all configuration parameters that determine the behaviour of the crawler and all its components.

Table of Contents

Default configuration

The file crawler-default.yaml lists the configuration elements presented below and provides a default value for them. This file is loaded automatically by the sub-classes of ConfigurableTopology and should not be modified. Instead we recommend that you provide a custom configuration file when launching a topology (see below).

Custom configuration

The custom configuration file is expected to be in YAML format and can be passed as a command-line argument as -conf <path_to_config_file> to the Java call of your Main class (which normally would be a sub-class of ConfigurableTopology).

The values in the custom configuration file will override the ones provided in crawler-default.yaml and it does not need to contain all the values.

You can use -conf <path_to_config_file> more than once on the command line, which allows to separate the configuration files for instance between the generic configuration and the configuration of a specific resources.

With Maven installed, you must first generate an uberjar:

mvn clean package

before submitting the topology using the storm command:

storm jar path/to/allmycode.jar org.me.MyCrawlTopology -conf my-crawler-conf.yaml -local

While deploying on a production Storm cluster, simply remove the local parameter.

Passing a configuration file is mandatory. A sample configuration file can be found here.

See here for a more detailed explanation of the configuration of the user agent.

Configuration options

The following tables describe all available configuration options and their default values. If one of the keys is not present in your YAML file, the default value will be taken.

Fetching and partitioning

Configuration for Bolts handling the fetching and partitioning of data. Some keys overlap with classes in Protocol as well, although most of that configuration can be found in the protocol section.

key default value description
fetcher.max.crawl.delay 30 The maximum number in seconds that will be accepted by Crawl-delay directives in robots.txt files. If the crawl-delay exceeds this value the behavior depends on the value of fetcher.max.crawl.delay.force.
fetcher.max.crawl.delay.force false Configures the behavior of fetcher if the robots.txt crawl-delay exceeds fetcher.max.crawl.delay. If false: the tuple is emitted to the StatusStream as an ERROR. If true: the queue delay is set to fetcher.max.crawl.delay.
fetcher.max.queue.size -1 The maximum length of the queue used to store items to be fetched by the FetcherBolt. A setting of -1 sets the length to Integer.MAX_VALUE
fetcher.max.throttle.sleep -1 The maximum amount of time to wait between fetches, if the time to wait exceeds this maximum that item will be sent to the back of the queue. Used in SimpleFetcherBolt. -1 to disable.
fetcher.max.urls.in.queues -1 Limits the number of URLs that can be stored in a fetch queue. This includes the URLs that are currently being fetched. -1 disables the limit.
fetcher.maxThreads.host/domain/ip fetcher.threads.per.queue Overwrites the default value of fetcher.threads.per.queue. This is very useful if you have domains/hosts/IPs that you want to crawl more intensively (e.g. because they a lot of URLs emitted by your Spout
fetcher.metrics.time.bucket.secs 10 Metrics events will be emitted to the system stream every value seconds. These events can be read by registering a MetricConsumer in the topology.
fetcher.queue.mode byHost Possible values are: byHost, byDomain, byIP. This parameter influences how FetchQueues are grouped inside the FetcherBolt. This influences the overall thread count and things like crawl delays (see below)
fetcher.server.delay 1 Defines delay between crawls in the same queue if no Craw-delay is defines for this URL in the pages robots.txt. Note: For multi-threaded queues neither this value nor the one from the robots.txt will be honored. See fetcher.server.min.delay.
fetcher.server.delay.force false Defines the behavior of fetcher when the crawl-delay in the robots.txt is smaller than the value configured in fetcher.server.delay. If false the shorter crawl-delay from the robots.txt is used. If true the longer configured delay is forced.
fetcher.server.min.delay 0 Defines the delay between crawls in the same queue if a queue has > 1 thread (fetcher.server.delay is use otherwise). The Crawl-delay declared in the robots.txt is ignored in this case and this value is taken.
fetcher.threads.number 10 The number of threads that fetch pages from all queues concurrently. This threads does the actual work of downloading the page. Increase this to get more throughput at a cost of higher network, CPU and memory utilisation. Tweak this value carefully while looking at your system resources to find a value that works best for your hardware and network infrastructure.
fetcher.threads.per.queue 1 The default number of threads per queue. This can be overwritten for specific hosts/domains/IPs. See below
fetcher.timeout.queue -1 The maximum time in seconds that an item can wait in the queue. -1 disables the timeout
fetcherbolt.queue.debug.filepath "" The Path to a debug log, e.g. /tmp/fetcher-dump-{port}. The content of the queues will be dumped to the logs. The port number must match the one used by the FetcherBolt instance.
http.agent.description - A description to be part of the User-Agent request header for requests issued by the crawler
http.agent.email - An Email address to be part of the User-Agent request header for requests issued by the crawler
http.agent.name - A name to be part of the User-Agent request header for requests issued by the crawler
http.agent.url - A URL to be part of the User-Agent request header for requests issued by the crawler (e.g. your Companies Homepage)
http.agent.version - A version to be part of the User-Agent request header for requests issued by the crawler
http.basicauth.password - Password associated with the property http.basicauth.user for the Basic Authentication
http.basicauth.user - A user used for the Basic Authentication implemented in HTTPClient protocole
http.content.limit -1 The maximum number of bytes for returned HTTP response bodies. By default no limit is applied. In the generated archetype a limit of 65536 is present.
http.protocol.implementation com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol The Protocol implementation for plain HTTP
http.proxy.host - A HTTP proxy server to be used for all requests made by the crawler
http.proxy.pass - Password to use for HTTP Proxy basic authentication
http.proxy.port 8080 The port of your HTTP proxy server
http.proxy.user - Username to use for HTTP Proxy basic authentication
http.robots.403.allow true Defines what happens when a request for robots.txt is responded to with HTTP 403 (Forbidden). When set to true the crawler will crawl all pages of the domain. If set to false the crawler will not fetch any of the pages of this domain.
http.robots.agents '' Comma separated additional user-agent strings to be used for the interpretation of the robots.txt. If left empty (default) than the robots.txt is interpreted with the value of http.agent.name
http.robots.file.skip false 1.17 and later, replaces http.skip.robots Ignore robots.txt rules (not recommended)
http.skip.robots false 1.16 and earlier, replaced by http.robots.file.skip Ignore robots.txt rules (not recommended)
http.store.headers false
http.store.responsetime true not yet implemented - whether or not to store the response time time in the Metadata
http.timeout 10000 A connection timeout specified in milliseconds. Tuples that run into this timeout will be emitted with the status ERROR in the StatusStream
http.use.cookies false Use cookies from the response in requests sent to direct child links.
https.protocol.implementation com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol The Protocol implementation for HTTP over SSL
partition.url.mode byHost Possible values are: byHost, byDomain, byIP. Defines how URLs are partitioned and by that routed to the FetcherBolt. For example byIP would lead to all tuples with a URL that is served by the same IP address to be always (for the lifetime of your topology) fetched by the same Storm task. This partitioning is important because it makes things like e.g. caching a robots.txt file for a specific domain very efficient. The value you specify here is being used to make use of Storms Field Grouping.
protocols http,https The protocols to support. Each of them has a corresponding <proto>.protocol.implementation directive. Don't touch this unless you are implementing additional protocols to be supported.
redirections.allowed true If URL redirects are allowed or not. If set to true, the crawler will emit the targeted URL in the StatusStream with the status DISCOVERED
sitemap.discovery false Enable automatic discovery of site maps.

Protocol

Configuration for the Protocol implementations. Note that some of configuration for these modules is shared and also found in Fetching and partitioning

key default value description
cacheConfigParamName maximumSize=10000,expireAfterWrite=6h CacheBuilder configuration for the robots cache in RobotRulesParser
errorcacheConfigParamName maximumSize=10000,expireAfterWrite=1h CacheBuilder configuration for the error cache in RobotRulesParser
file.encoding UTF-8 The encoding of files read by FileProtocol
http.custom.headers - Custom HTTP headers
http.accept - HTTP Accept headers to send with connections using HttpProtocol.
http.accept.language - HTTP Accept-Language headers to send with connections using HttpProtocol.
http.content.partial.as.trimmed false If true, tells OKHTTP to accept partially fetched content and mark it as trimmed content. Sets TrimmedContentReason to DISCONNECT
http.trust.everything true If true, OKHTTP should trust all SSL/TLS connections.
navigationfilters.config.file - A JSON configuration pointing to a class which extends NavigationFilter. See the dynamic content blog post for details.
selenium.addresses - A list of addresses to WebDriver servers
selenium.capabilities - A map containing desired WebDriver capabilities. JavaScript is always enabled.
selenium.delegated.protocol - A string pointing to an implementation of a class which implements com.digitalpebble.stormcrawler.protocol. This is called by DelegatorRemoteDriverProtocol if the incoming URL does not have protocol.use.selenium in its metadata. This allows using Selenium for a subset of the crawl.
selenium.implicitlyWait 0 The WebDriver timeout for the element location strategy to attempt to find elements.
selenium.instances.num 1 The number of instances to create per WebDriver connection (each item in selenium.addresses).
selenium.pageLoadTimeout 0 The WebDriver timeout for attempting a page navigation load.
selenium.setScriptTimeout 0 The WebDriver timeout for executing a WebDriver script. If NULL, there is no limit.
topology.message.timeout.secs -1 Seconds OKHTTP will wait for a page to be fetched

Indexing

The values below are used by sub-classes of AbstractIndexerBolt. Examples: StdOut, ElasticSearch. These classes persist the outcome of your crawling process and receive tuples enriched with Metadata (with all information gathered by previous Bolts)

key default value description
indexer.md.filter - A YAML List of key=value strings that let you filter records that should be index based on Metadata of a tuple. If specified, only tuples that match the given filter are being indexed. This is just used by the helper method AbstractIndexerBolt.filterDocument(Metadata). Using this method is on the responsibility of the implementing class. [Here is an example] (https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/indexing/StdOutIndexer.java#L56)
indexer.md.mapping - A YAML List of key=value strings that let you define a mapping of fields that occur in the Metadata of a tuple to field-names for your persistence layer. The AbstractIndexerBolt provides a method names filterMetadata(Metadata) that sub-classes should use inside their execute() method in order to apply this mapping to the Metadata object. Here is an example.
indexer.text.fieldname - The fieldname that should be used to index the content of HTML body. The usage of this is again in the responsibility of the class that extends AbstractIndexerBolt. The value of this can be accessed using the protected method fieldNameForText(). Here is an example.
indexer.url.fieldname - Same as above - indexer.text.fieldname just for the URL Field

Status persistence

This refers to persisting the status of a URL (e.g. ERROR, DISCOVERED etc.) along with a something like a nextFetchDate that is being calculated by a Scheduler

key default value description
fetchInterval.default 1440 In minutes - how to schedule re-visits of pages. 1 Day by default. This is used by the DefaultScheduler. If you need customized scheduling logic, just implement your own Scheduler. Note: the Scheduler class is not yet configurable. See or update (this issue)[https://github.com/DigitalPebble/storm-crawler/issues/104] if you need this bahvior. Should be quite easy to make the implemetation class configurable.
fetchInterval.error 44640 In minutes - how often to re-visit pages with an error (HTTP 4XX or 5XX). Every month by default. Identified by tuples in the (StatusStream)[https://github.com/DigitalPebble/storm-crawler/wiki/StatusStream] with the state of ERROR.
fetchInterval.fetch.error 120 In minutes - how often to re-visit pages with a fetch error. Every two hours by default. Identified by tuples in the (StatusStream)[https://github.com/DigitalPebble/storm-crawler/wiki/StatusStream] with the state of FETCH_ERROR
status.updater.cache.spec maximumSize=10000, expireAfterAccess=1h A cache specification string that defines the size and behavior of the above cache.
status.updater.use.cache true Using this cache helps to prevent persisting the same URLs over and over again. The store() method of your implementation of AbstractStatusUpdaterBolt (example) is only called if a URL does not already exist in the cache. This is a simple but efficient improvement to avoid re-persisting e.g. the same internal links over and over again.

Parsing

Configures parsing of fetched text and the handling of discovered URIs

key default value description
collections.file collections.json The name of the configuration for the CollectionTagger
collections.key collections Tags will be stored in the metadata with this as the key. If there is not collections key in the JSON configuration, Collections filter reads the key from the main configuration.
feed.filter.hours.since.published -1 When a link is found by FeedParserBolt, discard it if it has a published time older than value hours.
feed.sniffContent false If the metadata doesn't already indicate that the page is a feed, tells FeedParserBolt to sniff the content type metadata and the first part of the file to see if it can detect a feed. This will only work if the server returns rss+xml or has <rss in the first few bytes of the content.
parsefilters.config.file - Path to a configuration file for ParseFilters. The contents of which are described in the ParseFilters wiki page.
parsefilters.config.file parsefilters.json The JSON configuration file that defines your ParseFilters. Here is the default one. This influences the behavior of JSoupParserBolt and SiteMapParserBolt. Note: if you want to specify your own file you should give it a different name than parsefilters.json. For more information see here
parser.emitOutlinks true Whether or not to emit outgoing links found in the parsed HTML document to the StatusStrean as DISCOVERED. Your URL Filters are applied to outgoing links before they are emitted. This option being true is crucial if you are building a recursive crawler.
parser.emitOutlinks.max.per.page -1 limits the number of links sent from a page
textextractor.exclude.tags "" A list of HTML tags that should be ignored when the TextExtractor is searching for text.
textextractor.include.pattern "" A list of regex patterns for TextExtractor to match text against. Only text matching the patterns will be returned.
textextractor.no.text false Enable to stop TextExtractor from extracting any text at all.
track.anchors true Whether or not to add the anchor text (can be > 1) of (filtered) outgoing links with the key anchors to the Metadata of a tuple.
urlfilters.config.file urlfilters.json A JSON configuration file that defines URL filtering strategy. Here is the default implementation. Please also refer to URLFilters. Note: if you want to specify your own file you should give it a different name than urlfilters.json. For more information see here

Metadata

Options on how Storm Crawler should handle metadata tracking as well as minimising metadata clashes

key default value description
metadata.persist - Which metadata to persist for a given document but not transfer to outlinks. Value is either a vector or a single valued String. fetch.error.count is always added
metadata.track.depth true Whether or not to track the depth of a crawled URL. This is a simple counter that is being tracked for outgoing links in the Metadata and incremented by 1 for every page that was crawled to find a specific link. This can be useful to let your Spout decide/sort which URLs to emit based on their depth. You could use this to influence the behavior of your recursive crawl (e.g. prefer pages with a low depth count).
metadata.track.path true Whether or not to track the URL path of outgoing links (all URLs that the crawler crawled to find this link) in the Metadata. The Metadata field name for this is url.path. It's a list of URLs that represent the crawl path (how did the crawler find this page).
metadata.transfer - Which metadata to transfer to the outlinks and persist for a given document. Value is either a vector or a single valued String.
metadata.transfer.class com.digitalpebble.stormcrawler.util.MetadataTransfer Class to use for transfering metadata to outlinks. Must extend the class MetadataTransfer
protocol.md.prefix - Prefix all metadata received by the remote server so that internal metadata with the same name (such as the remote IP address) is not overwritten. Further discussion in issue 776