Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EXPERIMENT: proof of concept: refactor for an http-rb solr json writer #290

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jrochkind
Copy link
Contributor

@jrochkind jrochkind commented Dec 13, 2022

An experimental proof of concept that includes an http-rb adapter.

It turns out there's a lot of logic in the SolrJsonWriter that's actually about managing the "batching" in a thread-safe way (where multiple threads can be puting individual output reccords), and handling error messages in that multi-threaded environment, etc.

So I first turn the SolrJsonWriter into an abstract base-class, that contains all that logic, but where anything relevant to a specific http adapter has been factored out into some "abstract" methods to be implemented by sub-class. Re-factor the HTTPClient-based SolrJsonWriter itself to be such a sub-class, with the same name for legacy backwards compat.

Also add a new sub-class instead based on http-rb, Traject::SolrJsonWriterHttpRb.

Persistent HTTP Connections

One assumption I had is that it is important for performance to re-use persistent (keep-alive) HTTP connections. The idea is that opening a new HTTP connection (especially with SSL negotation) on every http request is going to be a performance problem. This is what motivated the original design. But this does complicate things somewhat -- I think it would make using faraday infeasible, as faraday is not (when I've looked before), not very careful with it's API/semantics around persistent http connections, hard to ensure it'll work right.

Even working directly with http clients, I find that for instance with http-rb, you need to have the same HTTP request headers, and same timeout, for all requests in a given persistent connection! This is annoying , but was able to make it mostly work.

If it turns out persistent HTTP keep-alive connections are actually irrelevant, more options might open up. But for now, I'm sticking to it.

A proof of concept

Things that in my opinion make this unfinished or require more thought or I don't love.

  • I think this superclass/sub-class architecture does make the code more abstract and hard to read/follow/maintain, and it already wasn't great. If things outside of traject implement sub-classes, then it's more public API to worry about maintaining.

  • The classes involved definitely need more docs that aren't yet in this PR

  • Kind of punting on tests at the moment. The new sub-class isn't tested. How do we test superclass? How do we test each sub-class? (We're not using rspec with ti's "shared examples" feature, so don't have access to that).

  • Sub-classes are accessing iVars directly, we should make attr_reader accessors for any of them instead, sub-classes accessing iVars set in superclass directly is not a good inheritance API.

  • If we really include multiple adapters, we include dependencies in gemspec for all of them? Pro's/con's either way, traject was originally meant for people who wouldn't be able to handle installing extra optional dependencies, in bundler early days.

    • and in general, release plan/backwards compat. Instead of trying to provide/maintain multiple solr json batch adapters , should we just switch over to one new preferred one, and make the old one legacy, maybe only add-able through an additional plugin gem, in a next major version?

Also I realized -- the existing SolrJsonWriter's feature to have multiple separate writing threads? I'm not sure it's actually useful/makes sense. In my own use, i have never found any performance advantage to using anything but 0 or 1 setting here. HOWEVER I don't think eliminating that feature really has any complexity benefit, since the writer is going to be called by multiple threads anyway, it needs to be thread-safe in much the same way anyway, it doens't really save us much.

Next steps/to test this

If you want to test the http-rb writer, somehow point your installation at this branch, and then in your traject config:

settings do 
  provide "writer_class_name", "Traject::SolrJsonWriterHttpRb"
end

OR, on the command line, traject -w Traject::SolrJsonWriterHttpRb ...

I would be very interested to have people test this and say:

  • Did it work flawlessly as a replacement, no problems?
  • Any performance differences with previous traditional writer?

However, I'm not sure when the right time is to ask a lot of people to test. Probably not yet, becuase you can't ask a lot of people to do that all the time, you only get one shot.

But @billdueber , maybe you want to? Maybe there are a couple other volunteers? (but again, I wouldn't try to recruit too hard yet, but would wait until it's more mature).

I honestly probably won't do too much with this branch/PR for a while. Spent today on this "spike" for a proof of concept, and producing something that actually can be tested in real world apps. I'm leaving it here for you or anyone else to experiment with or take further; I could come back to it in the future, but for now am hoping to turn back to other projects.

httpx?

I know @billdueber is interested in httpx rather than http-rb.

Theoretically the pattern used here should be a good guide for you to write an httpx subclass instead.

What gives me pause is the same old issue of trying to re-use persistent http connections in possibly multi-threaded use. It is not clear to me what httpx's actual semantics are here. I found this sentence, which gives me pause:

Because pools are thread-local, any call from a session will be thread-safe.

To me this suggests that there is no way to share a persistent HTTP connection between threads? Thread-safe, sure, because it's not capable of re-using the connection across more than one thread.

That could be fine -- we might just need to have a different httpx objects (sessions?) themselves in a pool, checked out to each thread -- which is what I'm doing with http-rb anyway, using the mperhan/connection-pool gem that the http-rb docs recommend, but can be used with anything.

I just don't really understand what the "pool" is that's within a "session" then... maybe I don't need to?

@jrochkind
Copy link
Contributor Author

Re httpx:

And actually, it occurs to me… if we have our own thread pool, maybe letting httpx use thread local storate for connections is actually fine….

it’ll just make as many persistent connections as we have threads sharing the same session object. Which is fine maybe? May actually be what HTTPClient was doing anyway?

So @billdueber may have been right after all that it’ll Just Work.

But making my own pool of http-rb using mperham/connection-pool worked out just fine, is simple code as you can see, and is a pattern that can be used with any http client that supports persistence.

This PR does not include an httpx writer, but should be do-able along the same lines. With or without an external mperham/connection-pool, or just using httpx's internal.

@billdueber
Copy link
Member

My first thought is that I'd much rather extract the base class into a traject-independent implementation (which takes things like url, number of threads, queue size, etc.), because I'd really like to be able to use the batch writer -- it's incredibly useful code! -- without having to build up a traject settings object and without having to wrap my not-built-by-traject solr document hashes in an otherwise empty context object just to send them to the writer, which immediately unwraps it to get the output_hash and send it to solr.

I'll probably end up doing that at some point as something like Solr::BatchJSONWriter or something if you don't want to mess with it here.

If there's already a thread pool, it doesn't seem like a terrible idea to have thread-local http objects. I doubt there are any http client implementations that don't carry any state, so none of them (the non-trivial ones, anyway) will be sharable across threads anyway.

[Just looked at httpx, and sure enough there's a pool = Thread.current[:httpx_connection_pool] ||= Pool.new, so obviously you're not the only one to think it might be a good way to go]

Man, I really wish Faraday was up to the challenge for this. It'd be nice to not have to hard-code for each http client.

@jrochkind
Copy link
Contributor Author

jrochkind commented Dec 14, 2022

OK, theres nothing like writing code to get the old thinker thinking. More thoughts...

If we realize we can use connection-pool to wrap any persistent http client, and make it available for multi-threaded use....

We actually could easily wrap faraday too. but how do you get a faraday client that will keep persistent keep-alive HTTP connections? I guess the net-http-persistent one probably would. But what about things like httpx or http-rb, that can do persistent keep-alive if so configured? faraday doens't have it's own api for asking "persistent", so you'd have to drop down to the faraday api for "let me configure the underlying adapter in an adapter-specific way" -- not totally sure how that would work in a traject context, I guess we'd have to let people have an API where they provide a block to configure faraday I guess?

It could happen.

In general, the pain that the faraday 1=>2 transition caused many projects I worked on -- kind of soured me on faraday. Unless you have a really really good use case need for it. We might.

If you want to experiment -- I think the work I've done here to totally strip out the specific-http-client-relevant stuff into identified methods, should be helpful for any other experiments you might want to do @billdueber. Even if we ultimately decide to roll it back into one class, just having stripped out the methods that rely on specific http client api, I hope will make it easier for you to experiment with faraday or httpx or whatever you like.

  • it's also still possible i"m totally wrong that keeping persistent keep-alive connections even matters... once we are batching updates to solr instead of doing one-http-request-per-update. It could be simpler if we don't care about persistent keep-alive http connections. It seems a pain to try to verify or dispell this assumption, I haven't done it yet.

@billdueber
Copy link
Member

billdueber commented Dec 14, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants