Skip to content

Proxies

Sam Ulrich edited this page Jul 9, 2021 · 3 revisions

StormCrawler's proxy system is built on top of the SCProxy class and the ProxyManager interface. Every proxy used in the system is formatted as a SCProxy. The ProxyManager implementations handle the management and delegation of their internal proxies. At the call of HTTPProtocol.getProtocolOutput() the ProxyManager.getProxy() is called to retrieve a proxy for the individual request. The ProxyManager interface can be implemented in a custom class to create custom logic for proxy management and load balancing.
The default ProxyManager implementation is SingleProxyManager. This ensures backwards compatibility for prior StormCrawler releases. To use MultiProxyManager or custom implementations pass the class path and name via the config parameter http.proxy.manager

http.proxy.manager: "com.digitalpebble.stormcrawler.proxy.MultiProxyManager"


StormCrawler implements two ProxyManager classes by default:

Manages a single proxy passed by the backwards compatible proxy fields in the configuration

http.proxy.host
http.proxy.port
http.proxy.type
http.proxy.user (optional)
http.proxy.pass (optional)

Manages multiple proxies passed through a TXT file. The file should contain connection strings for all proxies including the protocol and authentication (if needed). The file support comment lines (// or #) and empty lines. The file path should be passed via the config at the below field. The TXT file must be available to all nodes participating in the topology.

http.proxy.file

The MultiProxyManager load balances across proxies using one of the following schemes. The load balancing scheme can be passed via the config using http.proxy.rotation, the default value is ROUND_ROBIN.

  • ROUND_ROBIN

Evenly distributes load across all proxies

  • RANDOM

Randomly selects proxies using the native Java random number generator. RNG is seeded with the nanos at instantiation

  • LEAST_USED

Selects the proxy with the least amount of usage. This is performed lazily for speed and therefore will not account for changes to usages during the selection process. If no custom implementations are made this should theoretically operate the same as ROUND_ROBIN

The SCProxy class contains all of the information associated with proxy connection. In addition, it tracks the total usage of the proxy and optionally tracks the location of the proxy IP. Usage information is used for the LEAST_USED load balancing scheme. The location information is currently unused but left to enable custom implementation the ability to select proxies by location.