brotli compression not supported? #112

cypherlou · 2019-05-17T13:38:17Z

Robots.fetch() silently fails when the Accept Encoding header contains a compression algorithm (such as br) that is not supported.

The text was updated successfully, but these errors were encountered:

dlecocq · 2019-05-21T21:50:51Z

Under the hood, it uses requests. Does any of this help? https://github.com/kennethreitz/requests/issues/4525

It seems like with brotli installed and the newest urllib3 and requests, it should work.

Still, I agree that it should not fail silently. I have the bandwidth to review PRs, but lack the power to merge them :-/

quiddihub · 2019-05-23T05:02:16Z

@dlecocq, I will give that a go - many thanks for the suggestion.

I solved the problem by removing that as an accepted encoding method - but thought I would mention it as this behaviour meant that our site tests suggested there was no robots.txt for servers that supported that encoding method while servers that elected to use something else returned successfully.

Apart from repairing the problem, which may well manifest itself elsewhere so worth doing, I think the ability to provide a robots.txt content as a text string or file (ideally both) and to be able to access the content once it is in the possession of the module (regardless of which method was used to get it there; string, file or url) would help in this and other contexts.

Are you saying if I did a PR it wouldn't be accepted or that you can't do it yourself? I ask as I have development resource that might be able to help but I'm don't want to apply it if the PR is going to sit languishing.

dlecocq · 2019-05-23T15:27:30Z

Yeah, it's a pretty terrible failure mode since it effectively ignores the presence of such robots.txts. It looks like urllib3.HttpResponse.CONTENT_DECODERS enumerates the supported encoders for us (https://github.com/urllib3/urllib3/blob/64e413f1b2fef86a150ae747f00aab0e2be8e59c/src/urllib3/response.py#L184), so we could reach down in there to always get the supported encodings.

Re: robots.txt content as a string (or file), that seems reasonable to me. One mode of operation that's important to users of this is to be able to keep a compact representation of the parsed rules for a number of sites, dropping the raw content and rules for other bots as soon as the thing is parsed. As long as that remains possible, I don't think you'll find any pushback from Moz.

I was just saying that I can help out with the reviewing, but I'm no longer at Moz so can't press the 'merge' button for you. @lindseyreno is who'd probably end up doing that.

lindseyreno · 2019-05-23T17:28:20Z

If @dlecocq reviews it, I'll merge it :)

quiddihub · 2019-05-24T04:52:20Z

Thanks @dlecocq and @lindseyreno, really great to get such speedy feedback. I'll talk to one of the lead devs and see if we can get this looked at. They might not want to prioritise as there's a work around for the br encoding failure but we'll try our best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brotli compression not supported? #112

brotli compression not supported? #112

cypherlou commented May 17, 2019 •

edited

dlecocq commented May 21, 2019

quiddihub commented May 23, 2019

dlecocq commented May 23, 2019

lindseyreno commented May 23, 2019

quiddihub commented May 24, 2019

brotli compression not supported? #112

brotli compression not supported? #112

Comments

cypherlou commented May 17, 2019 • edited

dlecocq commented May 21, 2019

quiddihub commented May 23, 2019

dlecocq commented May 23, 2019

lindseyreno commented May 23, 2019

quiddihub commented May 24, 2019

cypherlou commented May 17, 2019 •

edited