Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Engine.HealthCheck tests network connectivity #431

Merged
merged 2 commits into from
Mar 22, 2021

Conversation

losvedir
Copy link
Contributor

@losvedir losvedir commented Mar 18, 2021

Summary of changes

Asana Ticket: 💳 Have realtime_signs restart if HTTP requests are failing

This adds another check to the 1/min Engine.Health to test network connectivity, to mitigate the issue we saw last week where the app "went silent" and stopped sending POSTs to the signs, to signs-ui, and to Splunk. A bounce fixed that issue.

The app with this PR does a HEAD request to google, and after 5 consecutive failures does a System.restart().

I tested System.restart() on opstech3 and it works reliably, cleanly shutting down the app and unloading the code, and then starting it back up again. The log train is maintained. Stopping and starting a service will prompt WinSW to clear the logs, but a restart from within the system does not.

I introduced mox to aid in testing without having to reach out to the network as much.

I tested this on opstech3 using API dev-blue, and then spun down the ECS instances to see it restart, and then spun them back up and it went on fine.

2021-03-18T11:41:06.206 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:40:06.383 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:39:06.200 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:38:06.217 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:37:06.311 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:36:06.208 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:error, :closed}
2021-03-18T11:35:06.200 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:34:06.213 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:33:06.307 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:32:06.212 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:error, :closed}
2021-03-18T11:31:06.202 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:30:06.344 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:ok, 503, [{"Date", "Thu, 18 Mar 2021 15:30:06 GMT"}, {"Content-Type", "application/json; charset=utf-8"}, {"Content-Length", "192"}, {"Connection", "keep-alive"}, {"cache-control", "max-age=0, private, must-revalidate"}, {"server", "Cowboy"}, {"x-request-id", "Fm15egtI_qxoW4oAAAxC"}]}
2021-03-18T11:29:06.198 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:ok, 503, [{"Date", "Thu, 18 Mar 2021 15:29:06 GMT"}, {"Content-Type", "application/json; charset=utf-8"}, {"Content-Length", "192"}, {"Connection", "keep-alive"}, {"cache-control", "max-age=0, private, must-revalidate"}, {"server", "Cowboy"}, {"x-request-id", "Fm15bApOdWl94kQAAABj"}]}
2021-03-18T11:28:06.307 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:ok, 503, [{"Server", "awselb/2.0"}, {"Date", "Thu, 18 Mar 2021 15:28:06 GMT"}, {"Content-Type", "text/html"}, {"Content-Length", "162"}, {"Connection", "keep-alive"}]}
2021-03-18T11:27:05.683 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Starting realtime_signs version '1.0.0'
2021-03-18T11:26:51.786 [error] node=rtsd_gjd-heartbeat-restart@OPSTECH3 restarting_application
2021-03-18T11:26:51.786 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:ok, 503, [{"Server", "awselb/2.0"}, {"Date", "Thu, 18 Mar 2021 15:26:51 GMT"}, {"Content-Type", "text/html"}, {"Content-Length", "162"}, {"Connection", "keep-alive"}]}
2021-03-18T11:25:51.631 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:ok, 503, [{"Server", "awselb/2.0"}, {"Date", "Thu, 18 Mar 2021 15:25:51 GMT"}, {"Content-Type", "text/html"}, {"Content-Length", "162"}, {"Connection", "keep-alive"}]}
2021-03-18T11:24:51.645 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:ok, 503, [{"Server", "awselb/2.0"}, {"Date", "Thu, 18 Mar 2021 15:24:51 GMT"}, {"Content-Type", "text/html"}, {"Content-Length", "162"}, {"Connection", "keep-alive"}]}
2021-03-18T11:23:51.746 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:ok, 502, [{"Server", "awselb/2.0"}, {"Date", "Thu, 18 Mar 2021 15:23:51 GMT"}, {"Content-Type", "text/html"}, {"Content-Length", "122"}, {"Connection", "keep-alive"}]}
2021-03-18T11:22:51.644 [warn]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network failure resp={:error, :closed}
2021-03-18T11:21:51.634 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success
2021-03-18T11:20:51.784 [info]  node=rtsd_gjd-heartbeat-restart@OPSTECH3 Elixir.Engine.NetworkCheck.Hackney check_network success

Reviewer Checklist

  • Meets ticket's acceptance criteria
  • Any new or changed functions have typespecs
  • Tests were added for any new functionality (don't just rely on Codecov)
  • This branch was deployed to the staging environment and is currently running with no unexpected increase in warnings, and no errors or crashes (compare on Splunk: staging vs. prod)

@losvedir losvedir requested review from a team, mkennedm and digitalcora and removed request for a team March 18, 2021 15:50
@codecov
Copy link

codecov bot commented Mar 18, 2021

Codecov Report

Merging #431 (f4b3808) into master (46e0e14) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #431   +/-   ##
=======================================
  Coverage   99.28%   99.29%           
=======================================
  Files          53       55    +2     
  Lines        1117     1131   +14     
=======================================
+ Hits         1109     1123   +14     
  Misses          8        8           

Copy link
Contributor

@digitalcora digitalcora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see my tech talk is already helping us test things 😄

lib/engine/health.ex Outdated Show resolved Hide resolved
lib/engine/health.ex Show resolved Hide resolved
lib/engine/network_check/api.ex Outdated Show resolved Hide resolved
lib/engine/network_check/hackney.ex Outdated Show resolved Hide resolved
lib/engine/network_check/hackney.ex Outdated Show resolved Hide resolved
lib/engine/network_check/hackney.ex Outdated Show resolved Hide resolved
test/engine/health_test.exs Outdated Show resolved Hide resolved
test/engine/health_test.exs Outdated Show resolved Hide resolved
mix.exs Outdated Show resolved Hide resolved
test/engine/network_check/hackney_test.exs Outdated Show resolved Hide resolved
@losvedir
Copy link
Contributor Author

Okay, whew. Thanks for the fantastic review. I just pushed up a commit that I believe addresses all these comments. Please "resolve conversation" everywhere you agree, and let me know if there's anything else. Upon approval, I'll squash this commit back into the main one.

Copy link
Contributor

@digitalcora digitalcora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more quick comments! My only point of (mild) concern is the inets thing, the rest aren't that important.

network_check_mod = Keyword.get(opts, :network_check_mod, Engine.NetworkCheck.Hackney)

restart_fn =
Keyword.get(opts, :restart_fn) || Application.get_env(:realtime_signs, :restart_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some reason this one is different from the others?

Keyword.get(opts, :restart_fn, Application.get_env(:realtime_signs, :restart_fn))

{:reply, [], timer_ref}
@spec log_restart() :: :ok
def log_restart do
Logger.error("log_restart")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, do we need this here, since we already log the restarting_application message right before the only place we call the restart_fn? This could instead be a no-op function (removing the need to even define anything extra on this module; the default config value could be fn -> :ok end).

Copy link
Contributor Author

@losvedir losvedir Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing around with it, this is I think what I ran into before. You can't have a function literal in a config .exs file. So I have to put a "no-op" function somewhere so I can use the capture syntax. I couldn't think of a good arity-0 function in the standard library to specify instead (ideally returning :ok). Maybe I can issue a PR to Elixir Lang to add System.ok()....

I'll remove the logging since, as you say, it's superfluous, and I guess change its name to be restart_noop to maybe be a bit clearer? And I'll add a comment, too. It doesn't have to be here, but I figured I might as well put it in this module since that's the reason for its existence.

mix.exs Outdated
@@ -30,6 +30,7 @@ defmodule RealtimeSigns.Mixfile do
def application do
[
extra_applications: [
:inets,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will start up inets also in non-test environments, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch. Moved it to test_helper.exs.


defmodule MockGoodNetwork do
def unquote(:do)(_data), do: {:proceed, [response: {200, 'OK'}]}
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could define these inside HackneyTest, thus not putting them in the global namespace (and maybe having to deal with a collision later if we call something else MockGoodNetwork).


{:ok, health_pid} =
Engine.Health.start_link(
name: :health_test3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to have this be the same value in every test, since tests within a given module are serialized (even with async).

Alternatively, by default the GenServer could be unnamed, and the app supervisor could explicitly give it one if needed — we've used this approach on other apps (see ApiWeb in the API, and specifically its spec for RequestTrack). Actually, is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, you're right. It doesn't need a name.

Copy link
Contributor

@digitalcora digitalcora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌩️ 🔁 ✅ 🎉

@losvedir losvedir merged commit 4a62885 into master Mar 22, 2021
@losvedir losvedir deleted the gjd-heartbeat-restart branch March 22, 2021 13:51
Copy link
Contributor

@mkennedm mkennedm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks. Good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants