Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce external search indexing of request list pages #8132

Open
5 tasks done
garethrees opened this issue Feb 15, 2024 · 5 comments · May be fixed by #8223
Open
5 tasks done

Reduce external search indexing of request list pages #8132

garethrees opened this issue Feb 15, 2024 · 5 comments · May be fixed by #8223
Assignees
Labels
improvement Improves existing functionality (UI tweaks, refactoring, performance, etc) reduce-admin Reduce issues coming to us in the first place x:uk

Comments

@garethrees
Copy link
Member

garethrees commented Feb 15, 2024

The main things we want indexed are record pages themselves (info request pages, user pages, authority pages, etc).

Snippets of request content often appear on list pages, and create a whack-a-mole situation when unhappy users find that external search engines have indexed a list page (e.g. /body/foo?page=12) that contains a cached snippet of PII that we've removed from the request page itself.

We should stop indexing of:

  • Request list pages (/list/all, /list/successful, etc) with a page= query param
  • Similar requests page (/request/:url_title/similar)
  • Body pages with a page= query param (/body/:url_name?page=N)
  • User pages with a page= query param (/user/:url_name?page=N)
  • User "wall" page (/user/:url_name/wall)

We might be able to do this via robots.txt, or could set via the X-Robots-Tag header depending on the page number:

before_action :set_no_crawl_headers, if: -> { params[:page].to_i > 1 }
@garethrees garethrees added x:uk improvement Improves existing functionality (UI tweaks, refactoring, performance, etc) reduce-admin Reduce issues coming to us in the first place labels Feb 15, 2024
@garethrees garethrees added this to the Reduce Admin Burden milestone Feb 15, 2024
@garethrees
Copy link
Member Author

🤔 Similar requests should already be disallowed for indexing https://github.com/mysociety/alaveteli/blob/0.44.0.0/public/robots.txt#L19

@HelenWDTK
Copy link
Contributor

It's not/similar/ it's /similar?page=4&utm_campaign=alaveteli-experiments-87&utm_content=sidebar_similar_requests&utm_medium=link&utm_source=whatdotheyknow

@garethrees
Copy link
Member Author

The * should include anything after */similar/* – I can see the issue though; should be */similar*

@HelenWDTK
Copy link
Contributor

Only if there is a / after the similar. See google (search /fish/ https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)

@garethrees
Copy link
Member Author

garethrees commented Apr 30, 2024

Might as well cover the actions noted in #8216 as part of this since it seems pretty easy to do:

  • The annotation page (/request/SLUG/annotate)
  • The similar requests page (/request/SLUG/similar)
  • Any links that always require a sign-in (reply, report, status update, request ZIP download)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves existing functionality (UI tweaks, refactoring, performance, etc) reduce-admin Reduce issues coming to us in the first place x:uk
Projects
None yet
3 participants