Skip to content

FetcherBolt(s)

Paul Armstrong edited this page Mar 17, 2021 · 2 revisions

There are actually 2 different bolts for fetching the content of URLs.

Both declare the same output

   declarer.declare(new Fields("url", "content", "metadata"));
        declarer.declareStream(
                com.digitalpebble.storm.crawler.Constants.StatusStreamName,
                new Fields("url", "metadata", "status"));

with the status stream being used for handling redirections, restrictions by robots directives or fetch errors whereas the default stream gets the binary content returned by the server as well as the metadata to the following components (typically a parsing bolt).

Both use the same protocol implementations and URLFilters to control the redirections.

The FetcherBolt has an internal set of queues where the incoming URLs are placed based on their hostname/domain/IP (see config fetcher.queue.mode) and a number of FetchingThreads (config fetcher.threads.number - 10 by default) which pull the URLS to fetch from the FetchQueues. When doing so, they make sure that a minimal amount of time (set with fetcher.server.delay - default 1 sec) has passed since the previous URL was fetched from the same queue. This mechanism ensures that we can control the rate at which requests are sent to the servers. A FetchQueue can also be used by more than one FetchingThread at a time (in which case fetcher.server.min.delay is used), based on the value of fetcher.threads.per.queue.

Incoming tuples spend very little time in the execute method of the FetcherBolt as they are put in the FetchQueues, which is why you'll find that the value of Execute latency in the Storm UI is pretty low. They get acked later on, after they've been fetched. The metric to watch for in the Storm UI is Process latency.

The SimpleFetcherBolt does not do any of this, hence its name. It just fetches incoming tuples in its execute method and does not do multi-threading. It does enforce politeness by checking when a URL can be fetched and will wait until it is the case. It is up to the user to declare multiple instances of the bolt in the Topology class and to manage how the URLs get distributed across the instances of SimpleFetcherBolt, often with the help of the URLPartitionerBolt. The throttling of the fetching can also be done at the Spout level (config topology.max.spout.pending).