Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New reasons prevent pages in a sitemap from being indexed by Google on site https://era.library.ualberta.ca/ #3289

Open
pgwillia opened this issue Nov 8, 2023 · 7 comments
Assignees

Comments

@pgwillia
Copy link
Member

pgwillia commented Nov 8, 2023

https://search.google.com/search-console/index?resource_id=https://era.library.ualberta.ca/&utm_source=wnc_20237597&utm_medium=gamma&utm_campaign=wnc_20237597&utm_content=msg_110624660&hl=en-CA

Image

Let @pgwillia know if you don't have access.

There was a major incident Sept 2nd that may be related TicketID=67376.

[Jeff] 2023-10-19

@pgwillia pgwillia changed the title New reasons prevent pages in a sitemap from being indexed on site https://era.library.ualberta.ca/ New reasons prevent pages in a sitemap from being indexed by Google on site https://era.library.ualberta.ca/ Nov 8, 2023
@jefferya
Copy link
Contributor

jefferya commented Dec 15, 2023

Not found (404) Google Search Console error

I suspect, assuming these ERA items were deleted and the DOIs remain then this might be related to the Oct 26, 2023 e-mail report of DOIs not being generated due to a sidekiq issue. I'll check the timeline from the logs and see if I can determine if recent deletes are removing DOI's as expected. I'll create an GitHub issue to track.

Also, the download link continues to work /items/1b17b01c-4eda-4453-95c9-27c764c5b69d/download/aef85f2d-5623-4b6b-bb78-a98b8f02d1f7 even though ERA returns a 404 for the object link. Is the download link connected to sidekiq or is sidekik a spurious correlation?

Todo in 2024:

  • continue checking proxy & era logs (a basic grep for the ID didn't find a delete indication. Does this mean a deletion didn't occur and the 404 is the result of another issue?

2024-01-10: Above suspicion seem to be wrong as the log analysis shows the 404 problem predates the Oct 2023 Sidekiq problem. For example:

I, [2022-01-11T13:45:40.973784 #3626]  INFO -- : [5899fcde-0f0a-493e-be18-e3d969061af0] Started GET "/items/1b17b01c-4eda-4453-95c9-27c764c5b69d" for xx.xx.xx.xx at 2022-01-11 13:45:40 -0700
I, [2022-01-11T13:45:40.982214 #3626]  INFO -- : [5899fcde-0f0a-493e-be18-e3d969061af0] Processing by ItemsController#show as HTML
I, [2022-01-11T13:45:40.982287 #3626]  INFO -- : [5899fcde-0f0a-493e-be18-e3d969061af0]   Parameters: {"subdomain"=>"era", "id"=>"1b17b01c-4eda-4453-95c9-27c764c5b69d"}
I, [2022-01-11T13:45:41.130287 #3626]  INFO -- : [5899fcde-0f0a-493e-be18-e3d969061af0]   Rendered items/show.html.erb within layouts/application (Duration: 132.7ms | Allocations: 10802)
I, [2022-01-11T13:45:41.130391 #3626]  INFO -- : [5899fcde-0f0a-493e-be18-e3d969061af0]   Rendered layout layouts/application.html.erb (Duration: 132.8ms | Allocations: 10860)
I, [2022-01-11T13:45:41.131464 #3626]  INFO -- : [5899fcde-0f0a-493e-be18-e3d969061af0]   Rendered public/404.html (Duration: 0.1ms | Allocations: 7)
I, [2022-01-11T13:45:41.131705 #3626]  INFO -- : [5899fcde-0f0a-493e-be18-e3d969061af0] Completed 404 Not Found in 149ms (Views: 0.5ms | ActiveRecord: 109.6ms | Allocations: 13685)

Summary:

From the bundle exec rails console (stg):

  • The model is returned via Item.find('cdec7295-46f1-4715-a924-09ca7cf7529f')
  • however validation fails: Item.find('cdec7295-46f1-4715-a924-09ca7cf7529f').valid? => false
  • Item.find('cdec7295-46f1-4715-a924-09ca7cf7529f').validate!
    /var/www/sites/jupiter/vendor/ruby/3.1.0/gems/activerecord-6.1.7.6/lib/active_record/validations.rb:80:in `raise_validation_error': Validation failed: Member of paths is missing collection with ID "19e31dff-2f25-47bd-aa47-2a22017a1ade" (ActiveRecord::RecordInvalid)
  • collection doesn't exist (not findable via the Rails console and in the ERA logs, HTTP 404 error from the beginning of the logs stored on logger.library.

No other Items appear invalid (run on staging and prod 2024-01-12)

Item.find_each do |item|
   if item.valid? === false then
       puts item.id
   end
end

1b17b01c-4eda-4453-95c9-27c764c5b69d
cdec7295-46f1-4715-a924-09ca7cf7529f

2024-01-30: fixed

irb(main):006:0> i = Item.find('1b17b01c-4eda-4453-95c9-27c764c5b69d')
irb(main):007:0> i.member_of_paths
=> ["b9bce94a-c976-43b0-853d-58b48797b3d1/19e31dff-2f25-47bd-aa47-2a22017a1ade"]
irb(main):008:0> i.member_of_paths = ['b9bce94a-c976-43b0-853d-58b48797b3d1/45ef8830-3f0f-445f-b07a-4a15ce4ea2bb']
=> ["b9bce94a-c976-43b0-853d-58b48797b3d1/45ef8830-3f0f-445f-b07a-4a15ce4ea2bb"]
irb(main):009:0> i.validate!
=> true
irb(main):010:0> i.save!
=> true

@pgwillia
Copy link
Member Author

Both those items still appear in search but link to a 404...

image
image

@jefferya
Copy link
Contributor

jefferya commented Jan 11, 2024

Google Search Console "Duplicate without user-selected canonical" category analysis.

This category doesn't seem to impact end users however below is the cause and an improvement.

One cause is the "view" and "download" links lead to the same canonical file but no rel="canonical" is present to indicate which link Google Search crawls should privilege therefore triggering the Google Search Console issues in the category.

Image

Image

@jefferya
Copy link
Contributor

jefferya commented Jan 31, 2024

When the sitemap filter is applied to Google Search Console "Duplicate without user-selected canonical", three items appeared where Google thinks the content is similar to another item in the sitemap. Upon investigating the Google Search Console URL inspection, the "User-declared canonical" and "Google-selected canonical" appear very similar. E-mail sent to the erahelp team for advice.

Image

  1. Quantification of soil property and map unit variability (the latter one is missing a file attachment)
    https://era.library.ualberta.ca/items/5aa538f5-92f7-4461-8732-7fe773e4d4e4
    https://era.library.ualberta.ca/items/68266568-b7fa-434a-aade-a4f2cde50870

  2. Moscow Goes Hollywood: The Russian Television Industry in the Global Age (the latter seems to have an invalid UUID; missing the fourth '-')
    https://era.library.ualberta.ca/items/76a108e9-a4d8-41f5-ab84-5c62215f3676
    https://era.library.ualberta.ca/items/76a108e9-a4d8-41f5-ab845c62215f3676

  3. Costumes of the Pavley-Oukrainsky Ballet: A Material History Analysis
    https://era.library.ualberta.ca/items/95a56a75-21f9-49f6-b710-47d67f6737f2
    https://era.library.ualberta.ca/items/4768c522-4aa9-40b9-93fa-b16b87c83d4e

Todo: is there a better way to find duplicates from the Jupiter/ERA backend?

@jefferya
Copy link
Contributor

jefferya commented Feb 1, 2024

Discovered - currently not indexed: "The page was found by Google, but not crawled yet. Typically, Google wanted to crawl the URL but this was expected to overload the site; therefore Google rescheduled the crawl. This is why the last crawl date is empty on the report."
https://support.google.com/webmasters/answer/7440203?hl=en&ref_topic=9456557&sjid=17221925575259517683-NC

Perhaps related to:

With file views and downloads:
Image

Without file views and downloads:
Image

@jefferya
Copy link
Contributor

Soft 404: https://support.google.com/webmasters/answer/7440203#soft_404

Possible fix:

@jefferya
Copy link
Contributor

jefferya commented Mar 6, 2024

Todo list 2024-03-06 (first round of changes to address the bulk of the errors; further scheduled rounds required)
Errors when filtered by sitemap

Error when no filter selected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants