Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass bot detectors #166

Open
LVerneyPEReN opened this issue Oct 9, 2020 · 7 comments
Open

Bypass bot detectors #166

LVerneyPEReN opened this issue Oct 9, 2020 · 7 comments

Comments

@LVerneyPEReN
Copy link
Contributor

LVerneyPEReN commented Oct 9, 2020

Hi,

Rakuten and Leboncoin have very strong bot detectors, hence preventing from automatically fetching their CGUs (at least on a regular OVH machine). See https://fr.shopping.rakuten.com/newhelp/conditions-generales/ or https://www.leboncoin.fr/dc/cgu. It is possible that #138 and having JS enabled will help here, but I think this won't be enough.

Best,

EDIT: Same for RueDuCommerce (see https://www.rueducommerce.fr/info/mentions-legales/cgv) or FNAC (https://www.fnac.com/Help/cgv-fnac#bl=footer), they all use the same system, powered by Datadome.

THouriezPEReN added a commit to THouriezPEReN/CGUs that referenced this issue Oct 13, 2020
Les trois ne fonctionnent pas (403)
(problème connu OpenTermsArchive#166)
@Ndpnt
Copy link
Member

Ndpnt commented Oct 15, 2020

Hi,

I hope using a headless browser will fix this. So I suggest to wait for #138 to be implemented and see if there is still this issue.
Unless you have a quicker to implement idea to fix it?

@LVerneyPEReN
Copy link
Contributor Author

Using a headless browser is not enough to fix this. You have to disguise it (https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth for instance) and you are still identified by your IP address (DataDome used on Leboncoin for instance does this), if you are connecting from a server infrastructure (not residential).

@MattiSG
Copy link
Member

MattiSG commented Oct 16, 2020

As discussed with @LucasVerneyDGE and @TomHouriezDGE, this option will be needed for some sources, even after #138 is fixed. However, it also raises legal questions. @LucasVerneyDGE will investigate which entities might have power to legally bypass access control systems, and we will design the most appropriate software architecture (opt-in, opt-out, plugin) based on the legal assessment 🙂

@martinratinaud
Copy link
Member

Hi all

jumping back on this matter as we encounter it more and more often

One of the common issues we find is being confronted to a 403 due to Web Application Firewall (WAF)

We already encountered 3 of them with

@LVerneyPEReN do you have any news?
I contacted Imperva and Cloudflare to become a whitelisted bot and am waiting for their answers

@MattiSG
Copy link
Member

MattiSG commented Apr 25, 2022

Legal analysis by PEReN was still pending on 08/03/2022.

Imperva and Cloudflare answers are still pending.

In order to help with prioritisation, instead of listing issues in this repository, they are now labeled in each affected instance with dedicated tags (403, timeout…).

@MattiSG
Copy link
Member

MattiSG commented Apr 24, 2023

@LVerneyPEReN did the PEReN finish its legal analysis? 🙂

On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud).

@martinratinaud
Copy link
Member

@LVerneyPEReN did the PEReN finish its legal analysis? 🙂

On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud).

Indeed, we did not 😔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants