[UI] UI workers occasionally stops serving requests #724

mreyescdl · 2021-07-19T21:45:45Z

Root cause of unresponsiveness not found. Logs shows activity prior to problem, but not excessive. Librato does not reflect host being over-resourced.

A simple restart of Puma fixes the issue.

Occurences

June 16th at 08:38 [UI03]
July 14th at 20:48 [UI03]
August 20th at 15:27 [UI04]
August 26th at 10:42 [UI03]
September 21 at 10:30 [UI03]- ETD Harvest in progress
September 25 at 6:53 [UI04]
September 25 at 6:41 [UI03]
September 27 at 10:30 [UI04]- ETD Harvest in progress
Oct 9 at 11:01 [UI03]
Oct 15 at 02:18:50 [UI03]
2021-10-19T10:29:56 [UI03]- ETD Harvest in progress
2021-10-19T10:29:56 [UI04]- ETD Harvest in progress
2021-10-21T10:18:25 [UI03] - ETD Harvest, UCB Harvest of a new collection
2021-11-06T12:12:17 UI03, no crawls in progress
2021-11-27T16:35 UI04
2021-11-25T03:42 UI05

Possible fixes

No effect

Try to crash ui05 by querying open context while a crawl is in progress
Ask eschol to crawl ui05 to see if we can reproduce

Next Steps

Attempt puma/gem update
- See puma pr 2613
Modify robots.txt to discourage crawling
Ask eschol team to harvest only one collection at a time
Confirm complete
/tmp space on ui boxes. Can we re-configure how puma is using tmp space?
- Ashley will implement this change on 10/21
- Systemd file: TMPDIR=...

Future Ideas

Explore cluster mode for puma
Implement the RackAttack to throttle crawls
- Add RackAttack for the Merritt UI #802
- Rack Attack starting implementation mrt-dashboard#107
Explore parameterized page sizes for the atom feed (also requested by UCB). Could this improve efficiency of a crawl?
Unpack prior log files to see if an eschol harvest or open context or other paginated command precedes the crash
More aggressive Nagios monitoring and restarts

The text was updated successfully, but these errors were encountered:

terrywbrady · 2021-08-20T23:16:44Z

UI04 on 8/20

I, [2021-08-20T15:27:15.408052 #17669]  INFO -- : [ae554367-238b-4fc4-b8db-9ae1d350122c] Completed 200 OK in 85ms (ActiveRecord: 3.6ms)
I, [2021-08-20T16:04:53.607467 #10262]  INFO -- : [afea795e-3a9a-4dfd-b3e5-b28f6fea0c68] Started GET "/" for 172.30.28.241 at 2021-08-20 16:04:53 -0700

No interesting journal data during this time. Restarted puma after a Nagios alert.

elopatin-uc3 · 2021-08-23T18:20:23Z

Another alert appeared on 8/22 over the weekend. However this time the issue cleared up on its own. Nagios noted a recovery roughly 20 minutes after first alert.

Additional notes:

We're a few versions behind on Puma (5.3.2 currently; there is a version 5.4.0 available); no Dependabot alerts, but perhaps we should look into updating.
If/when this occurs again, we should look into the amount of traffic on the site. Was Dryad impacted at all?

mreyescdl · 2021-11-09T19:23:59Z

Single mode vs Clustered mode.
We are running in single mode with the default 5 threads max.
A lot to take in with this subject, but here is a nice synopsis:

But what does it all mean?
So, if you’ve been paying attention so far, you’ve realized that a scalable Ruby web application needs slow client protection in the form of request buffering, and slow response protection in the form of some kind of concurrency - either multithreading or multiprocess/forking (preferably both). That only leaves Puma in clustered mode and Phusion Passenger 5 as scalable solutions for Ruby applications on Heroku running MRI/C Ruby. If you’re running your own setup, Unicorn with nginx becomes a viable option.

Source: https://www.speedshop.co/2015/07/29/scaling-ruby-apps-to-1000-rpm.html

mreyescdl · 2021-11-12T22:41:33Z

Please run the following on the worker before restarting the service

$ netstat | egrep -e 'LISTEN | ESTABLISHED'

$ ps -efa

Look for how many and in what state the Puma threads are:
$ htop -u dpr2 -p $(cat /dpr2/apps/ui/current/pid/puma.pid)

elopatin-uc3 · 2021-11-15T21:22:52Z

After discussion with Scott and Ryan, Dryad will be adding a retry for presigned URL requests:
datadryad/dryad-product-roadmap#1530

We will also request ALB logging, and Ryan will continue to log timeouts to a spreadsheet. We'll match those against any new logging that is enabled.

terrywbrady · 2021-11-15T22:36:31Z

@elopatin-uc3 , I recommend that we break out the issue that Dryad is seeing from this issue. It may be the same root cause, but the symptoms are different.

mreyescdl added the UI label Jul 19, 2021

mreyescdl self-assigned this Jul 19, 2021

mreyescdl changed the title ~~[UI] UI03 occasionally stops serving requests~~ [UI] UI workers occasionally stops serving requests Aug 20, 2021

terrywbrady added this to To be discussed in Weekly Operations Review Aug 20, 2021

elopatin-uc3 moved this from To be discussed to Discussed in meeting in Weekly Operations Review Aug 23, 2021

terrywbrady mentioned this issue Aug 26, 2021

Merritt Access Server Performance #774

Closed

terrywbrady mentioned this issue Sep 27, 2021

disable file download in Merritt UI CDLUC3/mrt-dashboard#104

Merged

terrywbrady pinned this issue Oct 19, 2021

This was referenced Oct 21, 2021

update puma, add robots.txt CDLUC3/mrt-dashboard#106

Merged

[RELEASE] UI Release - Puma Update - 10/21/2021 #831

Closed

elopatin-uc3 removed this from Discussed in meeting in Weekly Operations Review Oct 21, 2021

elopatin-uc3 added this to In Progress: Support in Merritt Project Board Oct 21, 2021

elopatin-uc3 added this to the UI Bugs and Improvements milestone Oct 21, 2021

elopatin-uc3 mentioned this issue Nov 16, 2021

Dryad seeing timeouts when requesting presigned URLs from Merritt #857

Closed

terrywbrady mentioned this issue Nov 19, 2021

[RELEASE] Merritt UI: Clustered Mode 12/1 at 4pm #867

Closed

elopatin-uc3 moved this from In Progress: Support to Deployed/Done in Merritt Project Board Dec 8, 2021

elopatin-uc3 added Sprint 62 Sprint 63 labels Dec 9, 2021

elopatin-uc3 closed this as completed Dec 9, 2021

elopatin-uc3 removed this from Deployed/Done in Merritt Project Board Dec 9, 2021

terrywbrady unpinned this issue Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UI] UI workers occasionally stops serving requests #724

[UI] UI workers occasionally stops serving requests #724

mreyescdl commented Jul 19, 2021 •

edited by terrywbrady

terrywbrady commented Aug 20, 2021

elopatin-uc3 commented Aug 23, 2021 •

edited

mreyescdl commented Nov 9, 2021

mreyescdl commented Nov 12, 2021 •

edited

elopatin-uc3 commented Nov 15, 2021

terrywbrady commented Nov 15, 2021

[UI] UI workers occasionally stops serving requests #724

[UI] UI workers occasionally stops serving requests #724

Comments

mreyescdl commented Jul 19, 2021 • edited by terrywbrady

Occurences

Possible fixes

No effect

Next Steps

Future Ideas

terrywbrady commented Aug 20, 2021

elopatin-uc3 commented Aug 23, 2021 • edited

mreyescdl commented Nov 9, 2021

mreyescdl commented Nov 12, 2021 • edited

elopatin-uc3 commented Nov 15, 2021

terrywbrady commented Nov 15, 2021

mreyescdl commented Jul 19, 2021 •

edited by terrywbrady

elopatin-uc3 commented Aug 23, 2021 •

edited

mreyescdl commented Nov 12, 2021 •

edited