Skip to content
Niek van der Maas edited this page Apr 30, 2021 · 3 revisions

Using proxies

Using a proxy service is essential when scraping sites. Datacenter IPs are often blocked, so you're best off with a residential proxy service. See here a list of proxy services.

Since proxy services require authentication, this is not as straight-forward as passing a startup flag to Chrome. There are some helper libraries out there:

All of these have some kind of issue related to stealthyness though (headers being removed, wrongly capitalized, DNS leaks, etc). The best approach for stealthyness is to use 3proxy instead, as a layer between the browser and the proxy service.

Download the latest 3proxy from https://3proxy.ru/download/stable/ and use the following config file as a starter:

daemon
pidfile /tmp/3proxy.pid
maxconn 2048
log /tmp/3proxy.log
logformat "L%O %I %T"
auth iponly
fakeresolve
allow * 127.0.0.1 * *
parent 1000 http IP PORT USER PASS
proxy -p23001 -i127.0.0.1 -a

Change IP with the IP address of the proxy server, e.g. resolved with dig +short proxy.server.com (the reason we do the lookup before writing the config file is to avoid DNS leaks). PORT is the proxy server port, and USER and PASS are the username and password. Start 3proxy with 3proxy /path/to/config-file.cfg, now you can start your Puppeteer browser with the launch flag --proxy-server=localhost:23001.

To watch your logfile, check tail -f /tmp/3proxy.log. To change your proxy settings (e.g. change your IP by changing a session identifier in the username), edit the logfile, then send the SIGUSR1 signal to the 3proxy PID in /tmp/3proxy.pid.