Fixed issue #643 and memory leak in hackney_pool #661

SergeTupchiy · 2020-10-13T20:35:08Z

Fixes: #643

This is similar to PR #656, but using hackney_manager:cancel_request/1 ensures that request state is also erased from process dictionary and deleted from hackney_manager_refs ETS table.

Additionally, this PR includes a memory leak fix (pending client not removed in hackney_pool:dequeue) and some code cleanup.

SergeTupchiy · 2020-10-13T20:37:12Z

src/hackney_pool.erl

@@ -127,8 +127,7 @@ do_checkout(Requester, Host, _Port, Transport, #client{options=Opts,
    {error, Reason} ->
      {error, Reason};
    {'EXIT', {timeout, _}} ->
-      % socket will still checkout so to avoid deadlock we send in a cancellation
-      gen_server:cast(Pool, {checkout_cancel, Connection, RequestRef}),
+      %% checkout should be canceled by the caller via hackney_manager


No need to cancel it here, as we now can rely on eventually caneling it by hackney_manager.

src/hackney_pool.erl

SergeTupchiy · 2020-10-13T20:52:15Z

src/hackney_pool.erl

-%------------------------------------------------------------------------------
-%% @private
-%%------------------------------------------------------------------------------
-del_from_queue(Connection, Ref, Queues) ->


This is not needed anymore. Moreover, there is remove_pending which can be used instead and is implemented in a more efficient manner: first check pending dict and (if found) filter queue.

isaacsanders · 2020-10-13T21:25:04Z

This is awesome, you have all but one test passing.

SergeTupchiy · 2020-10-15T15:33:22Z

This is awesome, you have all but one test passing.

Looks like the failed test (hackney_integration_tests:'-test_frees_manager_ets_when_body_is_in_client/0-fun-2-'/2 ) has a non-deterministic behavior, At least two reasons for it are as follows:

it checks a shared state (hackney_manger_refs ETS table), while all the tests are running in parallel.
hackney_manager removes a record (request state) from it asynchronously, see: https://github.com/benoitc/hackney/blob/master/src/hackney_manager.erl#L87. This means that there is only a weak guarantee that AfterCount would equal BeforeCount in the below code:

 {ok, _unusedBody} = hackney:body(Client),
    AfterCount = ets:info(hackney_manager_refs, size),
   ...

I think it's worth fixing this test, but it's not directly related to the issue.
Anyway, I re-trigered CI and all tests have passed. :)

isaacsanders · 2020-10-15T17:48:19Z

@benoitc @IceDragon200 I think we might have a solution here.

IceDragon200 · 2020-10-15T17:58:08Z

@isaacsanders @SergeTuchiy I guess the last test is to run a request loop and see if it still experiences checkout_timeout, I'll do that shortly and let you know how that goes

EDIT:

iex(25)> results = for _ <- 1..200, do: :hackney.request(:get, "http://localhost:6573"); Enum.uniq(results)
[error: :econnrefused]

Looks like it works as intended, so 👍

I'll close my PR and reference this one

ijunaid8989 · 2020-10-26T17:23:52Z

do anyone else faced this issue on this PR.

[%HTTPoison.Error{id: nil, reason: :checkout_failure}]

SergeTupchiy · 2020-10-26T19:37:31Z

do anyone else faced this issue on this PR.
[%HTTPoison.Error{id: nil, reason: :checkout_failure}]

@ijunaid8989
Could you please share some details to reproduce it?

ijunaid8989 · 2020-10-26T20:30:40Z

do anyone else faced this issue on this PR.
[%HTTPoison.Error{id: nil, reason: :checkout_failure}]
@ijunaid8989
Could you please share some details to reproduce it?

I have been using this branch in latest HTTPoison , and I am sending 500 per second post requests to a third party API. my code looks like this

  def post(url, file_path, image) do
    HTTPoison.post(
      url,
      {:multipart, [{file_path, image, []}]},
      [],
      hackney: [pool: :seaweedfs_upload_pool, recv_timeout: 15_000]
    )
  end

and I have these pool settings

      :hackney_pool.child_spec(:snapshot_pool, [timeout: 50_000, max_connections: 10_000]),
      :hackney_pool.child_spec(:seaweedfs_upload_pool, [timeout: 50_000, max_connections: 10_000]),
      :hackney_pool.child_spec(:seaweedfs_download_pool, [timeout: 50_000, max_connections: 10_000]),

in my application tree.

SergeTupchiy · 2020-10-27T11:06:42Z

@ijunaid8989
Are you using https?
It's hard to guess what exactly went wrong under your conditions, but I tried to make many concurrent https multipart requests using your pool settings and got checkout_failure, caused by:

exit, {timeout,
                            {gen_server,call,
                                [application_controller,
                                 {set_env,crypto,'$curves$',
                                      ...

My code snippet:

1> application:ensure_all_started(hackney).                                                                                                                                              {ok,[crypto,asn1,public_key,ssl,unicode_util_compat,idna,                                      
     mimerl,certifi,syntax_tools,parse_trans,ssl_verify_fun,
     metrics,hackney]}
2> hackney_pool:start_pool(multipart, [{max_connections, 10_000}, {timeout, 50_000}]).
ok
3> [spawn(fun() -> case hackney:post("https://example.com", [], {multipart, [{file, <<"/tmp/img.jpg">>, <<"image">>, []}]}, [with_body, {pool, multipart}, {recv_timeout, 5000}]) of {error, Reason} -> io:format("error: ~p ~n", [Reason]); _ -> ok end end) || _ <- lists:seq(1,300)].

This not introduced by this PR, as it's also reproducible on master.
If you reason is the same, I would recommend calling crypto:supports() before making any https requests. This will 'warm up' crypto by pre-populating its app config.

ijunaid8989 · 2020-10-27T12:01:02Z

@SergeTupchiy Actually we are not making any HTTPS request,

the origin of the request is HTTPS but the remote URL to which request is going is not HTTPs. do you still suggest the same?

SergeTupchiy · 2020-10-27T13:29:08Z

@ijunaid8989,
This is the place where it occurs:
https://github.com/SergeTupchiy/hackney/blob/master/src/hackney_pool.erl#L74

You can add some logging there and share the output. It would help to locate the issue.
For example:

try
  do_checkout(Requester, Host, Port, Transport, Client, ConnectTimeout, CheckoutTimeout) 
catch Err:Res:St ->
  io:format("checkout failure: ~p, ~p, ~p~n", [Err, Res, St]),
  {error, checkout_failure}
end

ijunaid8989 · 2020-10-28T19:07:38Z

@SergeTupchiy we are actually trapped with many errors. checkout timeout, checkout failure, closed.

edgurgel/httpoison#326

SergeTupchiy · 2020-10-28T20:12:24Z

@ijunaid8989,
There is no surprise that you (occasionally?) get all the errors you mentioned under some still not completely clear conditions.
This PR is intended to fix a rather clear and defined problem, see #643.
I'd be happy to at least try to fix the issues you face but it would be too time consuming (if feasible at all) for me, considering the details you shared.
Saying that one experiences checkout_timeout or closed doesn't really helps, since those errors are natural to happen in some cases .
I have no idea how to reproduce the issue you linked. It's have been open for two years and I guess I know why it's still not resolved: someone says that one http client works fine with service B but fails with server A, while another works with both. It doesn't look like an exhaustive description.

benoitc · 2020-11-01T22:46:06Z

sorry for the late reply. I have been quite busy these days.

Thanks for the patch! Looking at it.

isaacsanders · 2020-11-10T19:44:14Z

@benoitc Running into this issue currently. Any status update on your end?

lasernite · 2020-11-11T17:20:11Z

@benoitc Also getting bit hard by this, merging would be much appreciated so we don't have to further litter our codebase with catches and hard resets of the pool everytime we make a request with HTTPoison that may timeout. Thank you!

ijunaid8989 · 2020-11-12T04:06:06Z

@lasernite how do you hard reset your pool?

lasernite · 2020-11-14T11:48:13Z

@ijunaid8989 we're doing stuff like:

{:error, %HTTPoison.Error{reason: :checkout_timeout}} ->
    :hackney_pool.stop_pool(:default)

ijunaid8989 · 2020-11-14T11:49:26Z

@ijunaid8989 we're doing stuff like:

{:error, %HTTPoison.Error{reason: :checkout_timeout}} ->
    :hackney_pool.stop_pool(:default)

is it helpful? we also getting timeout/checkout_timeout/checkout_failure/closed?

Would that be helpful in such condition?

lasernite · 2020-11-14T11:52:05Z

@ijunaid8989 Yeh once we catch the timeouts and reset the pool the error goes away—although I assume this has some negative performance implications as unnecessarily clearing the pool frequently, but nothing noticeable so far.

ijunaid8989 · 2020-11-14T11:53:37Z

@ijunaid8989 Yeh once we catch the timeouts and reset the pool the error goes away—although I assume this has some negative performance implications as unnecessarily clearing the pool frequently, but nothing noticeable so far.

Okay thanks.

FYI: we have started hackney pools such as

      :hackney_pool.child_spec(:snapshot_pool, [timeout: 50_000, max_connections: 10_000]),
      :hackney_pool.child_spec(:seaweedfs_upload_pool, [timeout: 50_000, max_connections: 10_000]),
      :hackney_pool.child_spec(:seaweedfs_download_pool, [timeout: 50_000, max_connections: 10_000]),

in application.ex file.

lasernite · 2020-11-14T12:00:31Z

@ijunaid8989 Assuming those child_specs are getting started in a supervision tree in your application, I'm guessing you have to stop the pools by the atoms you customized above (:snapshot_pool, etc.) instead of :default when calling stop_pool. But don't take my word for it—I'm just using a library around hackney (HTTPoison) instead of accessing the methods directly like this, so I'll leave it up to you to see what works!

ijunaid8989 · 2020-11-14T12:03:44Z

@ijunaid8989 Assuming those child_specs are getting started in a supervision tree in your application, I'm guessing you have to stop the pools by the atoms you customized above (:snapshot_pool, etc.) instead of :default when calling stop_pool. But don't take my word for it—I'm just using a library around hackney (HTTPoison) instead of accessing the methods directly like this, so I'll leave it up to you to see what works!

Thanks, I am also using HTTPoison, but this is how we are starting the pools.

can you tell me how you have started the pools? I mean in the config file? or with each HTTP request?

Maybe your way of using pools with HTTPoison is different (correct).?

lasernite · 2020-11-14T12:28:27Z

@ijunaid8989 We're not setting up any custom pools. As HTTPoison is a dependency in mix.exs when we start the server the normal application lifecycle methods are hit to automatically start HTTPoison and add to the supervision tree. Pooling reduces latency on subsequent requests from the same host, but I haven't investigated whether our patch is stopping pooling altogether or whether when the pools get stopped the supervision tree is automatically restarting them again. Doesn't matter except for saving a few hundred milliseconds on subsequent requests in the best case, which if we are losing will be fixed when this library is updated and we remove the temporary patch.

remove pending by the same ref that was taken from the front of the queue

Replace casting checkout_cancel msg in hackney_pool with a hackney_manager:cancel_request/1 call from hackney_connect: this must ensure that pool checkout is canceled as well as client state is erased

SergeTupchiy · 2020-11-19T13:03:32Z

SergeTupchiy added 2 commits on Oct 13
@SergeTupchiy
Fix a memory leak in hackney_pool
c9979cf
@SergeTupchiy
Fix checkout_timeout error in hackney_pool caused by connection errors
2bf38f9

This is just to reformat my commit messages 😄

ijunaid8989 · 2020-11-19T13:56:06Z

@SergeTupchiy You got me, I was thinking its a fix for something!

benoitc · 2020-11-19T20:40:14Z

late update but testing the patch right now. thanks for the refactoring

ijunaid8989 · 2020-11-19T20:41:07Z

late update but testing the patch right now.

It would be great to see some progress on this. I had to clone HTTPoison as well as Hackney to use my own sources.

benoitc · 2020-11-20T10:03:39Z

late update but testing the patch right now.

It would be great to see some progress on this. I had to clone HTTPoison as well as Hackney to use my own sources.

this release is beeing tested. It will be normally shipped later today.

benoitc · 2020-11-20T10:05:52Z

@SergeTupchiy thanks for the patch. Just merged your changes. Will finish the testing later today of the release today. Normally a release should land today as well.

RaphSfeir · 2020-11-26T19:32:55Z

Is this now part of the 1.16.0 release? I'm also using my own sources and would like to switch back to official release but I keep getting the same issue. Thanks!

benoitc · 2020-11-27T00:07:56Z

this is part of the coming release yes. I will take care of it this morning . Have been sidetracked until now

On Thu 26 Nov 2020 at 20:33 Raphaël Sfeir ***@***.***> wrote: Is this now part of the 1.16.0 release? I'm also using my own sources and would like to switch back to official release but I keep getting the same issue. Thanks! — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#661 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIVEOVWZ6NYYEWRMXHTSR2UPHANCNFSM4SPUB3IQ> .

-- Sent from my Mobile

This version fixes a memory/connection pool leak that has resulted in app crashes on some of our other projects (and maybe this one?). ref: benoitc/hackney#661

This version fixes a memory/connection pool leak that has resulted in app crashes on some of our other projects (and maybe this one?). Since Hackney is not explicitly depended-on anywhere in `concierge_site` it can be removed from that app's dependencies entirely. ref: benoitc/hackney#661

SergeTupchiy commented Oct 13, 2020

View reviewed changes

src/hackney_pool.erl Show resolved Hide resolved

SergeTupchiy commented Oct 13, 2020

View reviewed changes

SergeTupchiy mentioned this pull request Oct 13, 2020

Fixed checkout_timeout on errors caused by not canceling the checkout #656

Closed

SergeTupchiy force-pushed the master branch from 36c6d5a to 061aaa8 Compare October 15, 2020 14:43

benoitc self-assigned this Oct 15, 2020

benoitc added the working on it label Oct 15, 2020

IceDragon200 mentioned this pull request Oct 23, 2020

checkout_timeout after enough http errors edgurgel/httpoison#414

Open

benoitc added the high priority label Nov 1, 2020

benoitc added this to the 1.17.0 milestone Nov 12, 2020

SergeTupchiy added 2 commits November 19, 2020 14:47

Fix a memory leak in hackney_pool

c9979cf

remove pending by the same ref that was taken from the front of the queue

Fix checkout_timeout error in hackney_pool caused by connection errors

2bf38f9

Replace casting checkout_cancel msg in hackney_pool with a hackney_manager:cancel_request/1 call from hackney_connect: this must ensure that pool checkout is canceled as well as client state is erased

SergeTupchiy force-pushed the master branch from 061aaa8 to 2bf38f9 Compare November 19, 2020 12:58

InoMurko mentioned this pull request Nov 19, 2020

hackney fix omgnetwork/omg-childchain-v2#173

Merged

benoitc merged commit 91763df into benoitc:master Nov 20, 2020

digitalcora mentioned this pull request Mar 5, 2021

chore: update Hackney to version 1.17 mbta/alerts_concierge#1132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed issue #643 and memory leak in hackney_pool #661

Fixed issue #643 and memory leak in hackney_pool #661

SergeTupchiy commented Oct 13, 2020 •

edited

SergeTupchiy Oct 13, 2020

SergeTupchiy Oct 13, 2020 •

edited

isaacsanders commented Oct 13, 2020

SergeTupchiy commented Oct 15, 2020 •

edited

isaacsanders commented Oct 15, 2020

IceDragon200 commented Oct 15, 2020 •

edited

ijunaid8989 commented Oct 26, 2020

SergeTupchiy commented Oct 26, 2020

ijunaid8989 commented Oct 26, 2020

SergeTupchiy commented Oct 27, 2020

ijunaid8989 commented Oct 27, 2020

SergeTupchiy commented Oct 27, 2020

ijunaid8989 commented Oct 28, 2020

SergeTupchiy commented Oct 28, 2020

benoitc commented Nov 1, 2020

isaacsanders commented Nov 10, 2020

lasernite commented Nov 11, 2020 •

edited

ijunaid8989 commented Nov 12, 2020

lasernite commented Nov 14, 2020

ijunaid8989 commented Nov 14, 2020

lasernite commented Nov 14, 2020

ijunaid8989 commented Nov 14, 2020

lasernite commented Nov 14, 2020

ijunaid8989 commented Nov 14, 2020

lasernite commented Nov 14, 2020

SergeTupchiy commented Nov 19, 2020

ijunaid8989 commented Nov 19, 2020

benoitc commented Nov 19, 2020 •

edited

ijunaid8989 commented Nov 19, 2020

benoitc commented Nov 20, 2020

benoitc commented Nov 20, 2020

RaphSfeir commented Nov 26, 2020

benoitc commented Nov 27, 2020 via email

Fixed issue #643 and memory leak in hackney_pool #661

Fixed issue #643 and memory leak in hackney_pool #661

Conversation

SergeTupchiy commented Oct 13, 2020 • edited

SergeTupchiy Oct 13, 2020

Choose a reason for hiding this comment

SergeTupchiy Oct 13, 2020 • edited

Choose a reason for hiding this comment

isaacsanders commented Oct 13, 2020

SergeTupchiy commented Oct 15, 2020 • edited

isaacsanders commented Oct 15, 2020

IceDragon200 commented Oct 15, 2020 • edited

ijunaid8989 commented Oct 26, 2020

SergeTupchiy commented Oct 26, 2020

ijunaid8989 commented Oct 26, 2020

SergeTupchiy commented Oct 27, 2020

ijunaid8989 commented Oct 27, 2020

SergeTupchiy commented Oct 27, 2020

ijunaid8989 commented Oct 28, 2020

SergeTupchiy commented Oct 28, 2020

benoitc commented Nov 1, 2020

isaacsanders commented Nov 10, 2020

lasernite commented Nov 11, 2020 • edited

ijunaid8989 commented Nov 12, 2020

lasernite commented Nov 14, 2020

ijunaid8989 commented Nov 14, 2020

lasernite commented Nov 14, 2020

ijunaid8989 commented Nov 14, 2020

lasernite commented Nov 14, 2020

ijunaid8989 commented Nov 14, 2020

lasernite commented Nov 14, 2020

SergeTupchiy commented Nov 19, 2020

ijunaid8989 commented Nov 19, 2020

benoitc commented Nov 19, 2020 • edited

ijunaid8989 commented Nov 19, 2020

benoitc commented Nov 20, 2020

benoitc commented Nov 20, 2020

RaphSfeir commented Nov 26, 2020

benoitc commented Nov 27, 2020 via email

SergeTupchiy commented Oct 13, 2020 •

edited

SergeTupchiy Oct 13, 2020 •

edited

SergeTupchiy commented Oct 15, 2020 •

edited

IceDragon200 commented Oct 15, 2020 •

edited

lasernite commented Nov 11, 2020 •

edited

benoitc commented Nov 19, 2020 •

edited