Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix expl_duck_duck_go.py once and for all #325

Merged
merged 2 commits into from May 3, 2020

Conversation

hemberger
Copy link
Contributor

  • Revert the incorrect selector change in Update expl_duck_duck_go.py #315
  • Add a custom user agent so that, when we submit the form, we get a results page instead of an error.

This reverts commit c3d11f0.

The original selector was correct.
With the default user agent, Duck Duck Go will not return search
results, and instead raise an error (maybe some anti-bot code?).

Thanks to @dillonko for the tip!
@hemberger hemberger merged commit 424efe3 into MechanicalSoup:master May 3, 2020
@hemberger hemberger deleted the duckduckgo branch May 3, 2020 04:12
@dillonko
Copy link
Contributor

dillonko commented May 3, 2020

@hemberger I am not sure your example runs it will have to be updated some more.

@moy
Copy link
Collaborator

moy commented May 3, 2020

I have mixed feelings about this. The user-agent detection seems rather clearly an anti-bot feature, and this is confirmed by the robots.txt: https://duckduckgo.com/robots.txt

Publishing an example showing how to work around this (admittedly weak) anti-bot protection won't actively harm duckduckgo, but to me this is giving a bad example for potential users.

@hemberger
Copy link
Contributor Author

The only thing I see robots.txt disallowing here is the Internet Archiver robot. Is there something I'm missing? Would appreciate some more detail if you have it!

My two cents:

A person running MechanicalSoup is not necessarily a "robot". You could be running a script interactively, or working in ipython; these examples are not extraordinarily different than using a browser as a human. In this case, I don't think we're encouraging the use of anti-bot exploitation here because we're not lying about what the user agent is. If we mimicked a Chrome user agent, then I would agree that we might be setting a bad example (though even that can be legitimate if we are, for example, testing a resource that we own).

All that said, I have no problem whatsoever with removing this example unconditionally. I just didn't want it to be wrong or broken.

@moy
Copy link
Collaborator

moy commented May 3, 2020

robots.txt says:

User-agent: *
Disallow: /lite
Disallow: /html

and searching the home page without javascript enabled redirects to /html (<form action="/html" in the page).

Unfortunately, I didn't find a page with terms&conditions or something meant for humans that would explicitly say what duckduckgo allows/disallows.

@hemberger
Copy link
Contributor Author

Ah, yes, I see that now.

As a side note, I think it's interesting that DuckDuckGo now fully supports pure HTML browsing again. Back when we were trying to fix this script, we were getting an error trying to view the DuckDuckGo site with javascript disabled. My guess is they are reusing the implementation that was the default when you first wrote this example.

What confuses me now is why we still get the javascript-enabled landing page with the default Requests user agent:

'User-agent': 'python-requests/2.23.0 (MechanicalSoup/1.0.0-dev)'

vs.

'User-agent': 'MechanicalSoup'

All other request headers are identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants