Reduce external search indexing of request list pages #8132

garethrees · 2024-02-15T09:56:34Z

The main things we want indexed are record pages themselves (info request pages, user pages, authority pages, etc).

Snippets of request content often appear on list pages, and create a whack-a-mole situation when unhappy users find that external search engines have indexed a list page (e.g. /body/foo?page=12) that contains a cached snippet of PII that we've removed from the request page itself.

We should stop indexing of:

Request list pages (/list/all, /list/successful, etc) with a page= query param
Similar requests page (/request/:url_title/similar)
Body pages with a page= query param (/body/:url_name?page=N)
User pages with a page= query param (/user/:url_name?page=N)
User "wall" page (/user/:url_name/wall)

We might be able to do this via robots.txt, or could set via the X-Robots-Tag header depending on the page number:

before_action :set_no_crawl_headers, if: -> { params[:page].to_i > 1 }

The text was updated successfully, but these errors were encountered:

garethrees · 2024-02-19T15:33:43Z

🤔 Similar requests should already be disallowed for indexing https://github.com/mysociety/alaveteli/blob/0.44.0.0/public/robots.txt#L19

HelenWDTK · 2024-02-19T16:51:02Z

It's not/similar/ it's /similar?page=4&utm_campaign=alaveteli-experiments-87&utm_content=sidebar_similar_requests&utm_medium=link&utm_source=whatdotheyknow

garethrees · 2024-02-19T16:53:20Z

The * should include anything after */similar/* – I can see the issue though; should be */similar*

HelenWDTK · 2024-02-19T16:54:48Z

Only if there is a / after the similar. See google (search /fish/ https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)

garethrees · 2024-04-30T16:57:42Z

Might as well cover the actions noted in #8216 as part of this since it seems pretty easy to do:

The annotation page (/request/SLUG/annotate)
The similar requests page (/request/SLUG/similar)
Any links that always require a sign-in (reply, report, status update, request ZIP download)

garethrees added x:uk improvement Improves existing functionality (UI tweaks, refactoring, performance, etc) reduce-admin Reduce issues coming to us in the first place labels Feb 15, 2024

garethrees added this to the Reduce Admin Burden milestone Feb 15, 2024

HelenWDTK mentioned this issue Apr 24, 2024

Stop search engines indexing contentless pages #8216

Closed

gbp linked a pull request Apr 30, 2024 that will close this issue

[#8132] Update actions and pages which set "noindex", "nofollow" crawler directives #8223

Open

garethrees assigned gbp May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce external search indexing of request list pages #8132

Reduce external search indexing of request list pages #8132

garethrees commented Feb 15, 2024 •

edited by gbp

garethrees commented Feb 19, 2024

HelenWDTK commented Feb 19, 2024

garethrees commented Feb 19, 2024

HelenWDTK commented Feb 19, 2024

garethrees commented Apr 30, 2024 •

edited

Reduce external search indexing of request list pages #8132

Reduce external search indexing of request list pages #8132

Comments

garethrees commented Feb 15, 2024 • edited by gbp

garethrees commented Feb 19, 2024

HelenWDTK commented Feb 19, 2024

garethrees commented Feb 19, 2024

HelenWDTK commented Feb 19, 2024

garethrees commented Apr 30, 2024 • edited

garethrees commented Feb 15, 2024 •

edited by gbp

garethrees commented Apr 30, 2024 •

edited