Fix expl_duck_duck_go.py once and for all #325

hemberger · 2020-05-03T03:59:58Z

Revert the incorrect selector change in Update expl_duck_duck_go.py #315
Add a custom user agent so that, when we submit the form, we get a results page instead of an error.

This reverts commit c3d11f0. The original selector was correct.

@dillonko

With the default user agent, Duck Duck Go will not return search results, and instead raise an error (maybe some anti-bot code?). Thanks to @dillonko for the tip!

dillonko · 2020-05-03T04:46:48Z

@hemberger I am not sure your example runs it will have to be updated some more.

moy · 2020-05-03T08:47:53Z

I have mixed feelings about this. The user-agent detection seems rather clearly an anti-bot feature, and this is confirmed by the robots.txt: https://duckduckgo.com/robots.txt

Publishing an example showing how to work around this (admittedly weak) anti-bot protection won't actively harm duckduckgo, but to me this is giving a bad example for potential users.

hemberger · 2020-05-03T09:51:59Z

The only thing I see robots.txt disallowing here is the Internet Archiver robot. Is there something I'm missing? Would appreciate some more detail if you have it!

My two cents:

A person running MechanicalSoup is not necessarily a "robot". You could be running a script interactively, or working in ipython; these examples are not extraordinarily different than using a browser as a human. In this case, I don't think we're encouraging the use of anti-bot exploitation here because we're not lying about what the user agent is. If we mimicked a Chrome user agent, then I would agree that we might be setting a bad example (though even that can be legitimate if we are, for example, testing a resource that we own).

All that said, I have no problem whatsoever with removing this example unconditionally. I just didn't want it to be wrong or broken.

moy · 2020-05-03T10:03:42Z

robots.txt says:

User-agent: *
Disallow: /lite
Disallow: /html

and searching the home page without javascript enabled redirects to /html (<form action="/html" in the page).

Unfortunately, I didn't find a page with terms&conditions or something meant for humans that would explicitly say what duckduckgo allows/disallows.

hemberger · 2020-05-03T10:45:07Z

Ah, yes, I see that now.

As a side note, I think it's interesting that DuckDuckGo now fully supports pure HTML browsing again. Back when we were trying to fix this script, we were getting an error trying to view the DuckDuckGo site with javascript disabled. My guess is they are reusing the implementation that was the default when you first wrote this example.

What confuses me now is why we still get the javascript-enabled landing page with the default Requests user agent:

'User-agent': 'python-requests/2.23.0 (MechanicalSoup/1.0.0-dev)'

vs.

'User-agent': 'MechanicalSoup'

All other request headers are identical.

hemberger added 2 commits May 2, 2020 20:54

Revert "Update expl_duck_duck_go.py"

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

f07ac24

This reverts commit c3d11f0. The original selector was correct.

hemberger mentioned this pull request May 3, 2020

Revert "Update expl_duck_duck_go.py" #318

Closed

hemberger merged commit 424efe3 into MechanicalSoup:master May 3, 2020

hemberger deleted the duckduckgo branch May 3, 2020 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix expl_duck_duck_go.py once and for all #325

Fix expl_duck_duck_go.py once and for all #325

hemberger commented May 3, 2020

Uh oh!

dillonko commented May 3, 2020

Uh oh!

moy commented May 3, 2020

Uh oh!

hemberger commented May 3, 2020

Uh oh!

moy commented May 3, 2020

Uh oh!

hemberger commented May 3, 2020

Uh oh!

Fix expl_duck_duck_go.py once and for all #325

Fix expl_duck_duck_go.py once and for all #325

Conversation

hemberger commented May 3, 2020

Uh oh!

dillonko commented May 3, 2020

Uh oh!

moy commented May 3, 2020

Uh oh!

hemberger commented May 3, 2020

Uh oh!

moy commented May 3, 2020

Uh oh!

hemberger commented May 3, 2020

Uh oh!