-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EXPERIMENT: proof of concept: refactor for an http-rb solr json writer #290
base: main
Are you sure you want to change the base?
Conversation
Re httpx: And actually, it occurs to me… if we have our own thread pool, maybe letting httpx use thread local storate for connections is actually fine…. it’ll just make as many persistent connections as we have threads sharing the same session object. Which is fine maybe? May actually be what HTTPClient was doing anyway? So @billdueber may have been right after all that it’ll Just Work. But making my own pool of http-rb using mperham/connection-pool worked out just fine, is simple code as you can see, and is a pattern that can be used with any http client that supports persistence. This PR does not include an httpx writer, but should be do-able along the same lines. With or without an external mperham/connection-pool, or just using httpx's internal. |
My first thought is that I'd much rather extract the base class into a traject-independent implementation (which takes things like url, number of threads, queue size, etc.), because I'd really like to be able to use the batch writer -- it's incredibly useful code! -- without having to build up a traject settings object and without having to wrap my not-built-by-traject solr document hashes in an otherwise empty context object just to send them to the writer, which immediately unwraps it to get the output_hash and send it to solr. I'll probably end up doing that at some point as something like If there's already a thread pool, it doesn't seem like a terrible idea to have thread-local http objects. I doubt there are any http client implementations that don't carry any state, so none of them (the non-trivial ones, anyway) will be sharable across threads anyway. [Just looked at httpx, and sure enough there's a Man, I really wish Faraday was up to the challenge for this. It'd be nice to not have to hard-code for each http client. |
OK, theres nothing like writing code to get the old thinker thinking. More thoughts... If we realize we can use We actually could easily wrap faraday too. but how do you get a faraday client that will keep persistent keep-alive HTTP connections? I guess the net-http-persistent one probably would. But what about things like It could happen. In general, the pain that the faraday 1=>2 transition caused many projects I worked on -- kind of soured me on faraday. Unless you have a really really good use case need for it. We might. If you want to experiment -- I think the work I've done here to totally strip out the specific-http-client-relevant stuff into identified methods, should be helpful for any other experiments you might want to do @billdueber. Even if we ultimately decide to roll it back into one class, just having stripped out the methods that rely on specific http client api, I hope will make it easier for you to experiment with faraday or httpx or whatever you like.
|
Totally. I don't have the bandwidth to experiment right now, unfortunately
-- maybe after all my end-of-year deadlines :-)
…On Wed, Dec 14, 2022 at 10:13 AM Jonathan Rochkind ***@***.***> wrote:
OK, theres nothing like writing code to get the old thinker thinking. More
thoughts...
If we realize we can use connection-pool to wrap *any* persistent http
client, and make it available for multi-threaded use....
We actually could easily wrap faraday too. but how do you get a faraday
client that will keep persistent keep-alive HTTP connections? I guess the
net-http-persistent one probably would. But what about things like httpx
or http-rb, that can do persistent keep-alive *if so configured*? faraday
doens't have it's own api for asking "persistent", so you'd have to drop
down to the faraday api for "let me configure the underlying adapter in an
adapter-specific way" -- not totally sure how that would work in a traject
context, I guess we'd have to let people have an API where they provide a
block to configure faraday I guess?
It could happen.
In general, the pain that the faraday 1=>2 transition caused many projects
I worked on -- kind of soured me on faraday. Unless you have a really
really good use case need for it. We might.
If you want to experiment -- I think the work I've done here to totally
strip out the specific-http-client-relevant stuff into identified methods,
should be helpful for any other experiments you might want to do
@billdueber <https://github.com/billdueber>. Even if we ultimately decide
to roll it back into one class, just having stripped out the methods that
rely on specific http client api, I hope will make it easier for you to
experiment with faraday or httpx or whatever you like.
—
Reply to this email directly, view it on GitHub
<#290 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA32VWUHSZHBSPN45TLQQ3WNHP2HANCNFSM6AAAAAAS5VE7UU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Bill Dueber
Library Systems Programmer
University of Michigan Library
|
An experimental proof of concept that includes an http-rb adapter.
It turns out there's a lot of logic in the SolrJsonWriter that's actually about managing the "batching" in a thread-safe way (where multiple threads can be
put
ing individual output reccords), and handling error messages in that multi-threaded environment, etc.So I first turn the SolrJsonWriter into an abstract base-class, that contains all that logic, but where anything relevant to a specific http adapter has been factored out into some "abstract" methods to be implemented by sub-class. Re-factor the
HTTPClient
-basedSolrJsonWriter
itself to be such a sub-class, with the same name for legacy backwards compat.Also add a new sub-class instead based on http-rb,
Traject::SolrJsonWriterHttpRb
.Persistent HTTP Connections
One assumption I had is that it is important for performance to re-use persistent (keep-alive) HTTP connections. The idea is that opening a new HTTP connection (especially with SSL negotation) on every http request is going to be a performance problem. This is what motivated the original design. But this does complicate things somewhat -- I think it would make using
faraday
infeasible, as faraday is not (when I've looked before), not very careful with it's API/semantics around persistent http connections, hard to ensure it'll work right.Even working directly with http clients, I find that for instance with http-rb, you need to have the same HTTP request headers, and same timeout, for all requests in a given persistent connection! This is annoying , but was able to make it mostly work.
If it turns out persistent HTTP keep-alive connections are actually irrelevant, more options might open up. But for now, I'm sticking to it.
A proof of concept
Things that in my opinion make this unfinished or require more thought or I don't love.
I think this superclass/sub-class architecture does make the code more abstract and hard to read/follow/maintain, and it already wasn't great. If things outside of traject implement sub-classes, then it's more public API to worry about maintaining.
The classes involved definitely need more docs that aren't yet in this PR
Kind of punting on tests at the moment. The new sub-class isn't tested. How do we test superclass? How do we test each sub-class? (We're not using rspec with ti's "shared examples" feature, so don't have access to that).
Sub-classes are accessing iVars directly, we should make attr_reader accessors for any of them instead, sub-classes accessing iVars set in superclass directly is not a good inheritance API.
If we really include multiple adapters, we include dependencies in gemspec for all of them? Pro's/con's either way, traject was originally meant for people who wouldn't be able to handle installing extra optional dependencies, in bundler early days.
Also I realized -- the existing
SolrJsonWriter
's feature to have multiple separate writing threads? I'm not sure it's actually useful/makes sense. In my own use, i have never found any performance advantage to using anything but0
or1
setting here. HOWEVER I don't think eliminating that feature really has any complexity benefit, since the writer is going to be called by multiple threads anyway, it needs to be thread-safe in much the same way anyway, it doens't really save us much.Next steps/to test this
If you want to test the http-rb writer, somehow point your installation at this branch, and then in your traject config:
OR, on the command line,
traject -w Traject::SolrJsonWriterHttpRb ...
I would be very interested to have people test this and say:
However, I'm not sure when the right time is to ask a lot of people to test. Probably not yet, becuase you can't ask a lot of people to do that all the time, you only get one shot.
But @billdueber , maybe you want to? Maybe there are a couple other volunteers? (but again, I wouldn't try to recruit too hard yet, but would wait until it's more mature).
I honestly probably won't do too much with this branch/PR for a while. Spent today on this "spike" for a proof of concept, and producing something that actually can be tested in real world apps. I'm leaving it here for you or anyone else to experiment with or take further; I could come back to it in the future, but for now am hoping to turn back to other projects.
httpx?
I know @billdueber is interested in httpx rather than http-rb.
Theoretically the pattern used here should be a good guide for you to write an httpx subclass instead.
What gives me pause is the same old issue of trying to re-use persistent http connections in possibly multi-threaded use. It is not clear to me what httpx's actual semantics are here. I found this sentence, which gives me pause:
To me this suggests that there is no way to share a persistent HTTP connection between threads? Thread-safe, sure, because it's not capable of re-using the connection across more than one thread.
That could be fine -- we might just need to have a different
httpx
objects (sessions?) themselves in a pool, checked out to each thread -- which is what I'm doing with http-rb anyway, using the mperhan/connection-pool gem that the http-rb docs recommend, but can be used with anything.I just don't really understand what the "pool" is that's within a "session" then... maybe I don't need to?