Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UI] UI workers occasionally stops serving requests #724

Closed
7 tasks done
mreyescdl opened this issue Jul 19, 2021 · 6 comments
Closed
7 tasks done

[UI] UI workers occasionally stops serving requests #724

mreyescdl opened this issue Jul 19, 2021 · 6 comments

Comments

@mreyescdl
Copy link
Contributor

mreyescdl commented Jul 19, 2021

Root cause of unresponsiveness not found. Logs shows activity prior to problem, but not excessive. Librato does not reflect host being over-resourced.

A simple restart of Puma fixes the issue.

Occurences

  • June 16th at 08:38 [UI03]
  • July 14th at 20:48 [UI03]
  • August 20th at 15:27 [UI04]
  • August 26th at 10:42 [UI03]
  • September 21 at 10:30 [UI03]- ETD Harvest in progress
  • September 25 at 6:53 [UI04]
  • September 25 at 6:41 [UI03]
  • September 27 at 10:30 [UI04]- ETD Harvest in progress
  • Oct 9 at 11:01 [UI03]
  • Oct 15 at 02:18:50 [UI03]
  • 2021-10-19T10:29:56 [UI03]- ETD Harvest in progress
  • 2021-10-19T10:29:56 [UI04]- ETD Harvest in progress
  • 2021-10-21T10:18:25 [UI03] - ETD Harvest, UCB Harvest of a new collection
  • 2021-11-06T12:12:17 UI03, no crawls in progress
  • 2021-11-27T16:35 UI04
  • 2021-11-25T03:42 UI05

Possible fixes

No effect

  • Try to crash ui05 by querying open context while a crawl is in progress
  • Ask eschol to crawl ui05 to see if we can reproduce

Next Steps

  • Attempt puma/gem update
    • See puma pr 2613
  • Modify robots.txt to discourage crawling
  • Ask eschol team to harvest only one collection at a time
  • Confirm complete
  • /tmp space on ui boxes. Can we re-configure how puma is using tmp space?
    • Ashley will implement this change on 10/21
    • Systemd file: TMPDIR=...

Future Ideas

@mreyescdl mreyescdl added the UI label Jul 19, 2021
@mreyescdl mreyescdl self-assigned this Jul 19, 2021
@mreyescdl mreyescdl changed the title [UI] UI03 occasionally stops serving requests [UI] UI workers occasionally stops serving requests Aug 20, 2021
@terrywbrady
Copy link
Contributor

UI04 on 8/20

I, [2021-08-20T15:27:15.408052 #17669]  INFO -- : [ae554367-238b-4fc4-b8db-9ae1d350122c] Completed 200 OK in 85ms (ActiveRecord: 3.6ms)
I, [2021-08-20T16:04:53.607467 #10262]  INFO -- : [afea795e-3a9a-4dfd-b3e5-b28f6fea0c68] Started GET "/" for 172.30.28.241 at 2021-08-20 16:04:53 -0700

No interesting journal data during this time. Restarted puma after a Nagios alert.

@terrywbrady terrywbrady added this to To be discussed in Weekly Operations Review Aug 20, 2021
@elopatin-uc3
Copy link
Contributor

elopatin-uc3 commented Aug 23, 2021

Another alert appeared on 8/22 over the weekend. However this time the issue cleared up on its own. Nagios noted a recovery roughly 20 minutes after first alert.

Additional notes:

  • We're a few versions behind on Puma (5.3.2 currently; there is a version 5.4.0 available); no Dependabot alerts, but perhaps we should look into updating.
  • If/when this occurs again, we should look into the amount of traffic on the site. Was Dryad impacted at all?

@mreyescdl
Copy link
Contributor Author

Single mode vs Clustered mode.
We are running in single mode with the default 5 threads max.
A lot to take in with this subject, but here is a nice synopsis:

But what does it all mean?
So, if you’ve been paying attention so far, you’ve realized that a scalable Ruby web application needs slow client protection in the form of request buffering, and slow response protection in the form of some kind of concurrency - either multithreading or multiprocess/forking (preferably both). That only leaves Puma in clustered mode and Phusion Passenger 5 as scalable solutions for Ruby applications on Heroku running MRI/C Ruby. If you’re running your own setup, Unicorn with nginx becomes a viable option.

Source: https://www.speedshop.co/2015/07/29/scaling-ruby-apps-to-1000-rpm.html

@mreyescdl
Copy link
Contributor Author

mreyescdl commented Nov 12, 2021

Please run the following on the worker before restarting the service

$ netstat | egrep -e 'LISTEN | ESTABLISHED'

$ ps -efa

Look for how many and in what state the Puma threads are:
$ htop -u dpr2 -p $(cat /dpr2/apps/ui/current/pid/puma.pid)

@elopatin-uc3
Copy link
Contributor

After discussion with Scott and Ryan, Dryad will be adding a retry for presigned URL requests:
datadryad/dryad-product-roadmap#1530

We will also request ALB logging, and Ryan will continue to log timeouts to a spreadsheet. We'll match those against any new logging that is enabled.

@terrywbrady
Copy link
Contributor

@elopatin-uc3 , I recommend that we break out the issue that Dryad is seeing from this issue. It may be the same root cause, but the symptoms are different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants