Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NioLoop to share Selector (typical use case is HTTP selector) #18

Open
sagenschneider opened this issue Aug 1, 2018 · 8 comments
Open

Comments

@sagenschneider
Copy link
Contributor

Rather than trying to code in the threading models and Selectors for a DataSource and it's Connections, allow them to be configured externally.

A primary use case for this is sharing the HTTP Selector. By allowing the NIO loop to be provided to the DataSource for it's connections it allows the HTTP Selector and threads to be used to service the communication with the database. This will allow a single thread to service HTTP requests, which is my understanding of the aim of NIO.

This is further work on #12 .

Example code (not yet working) available to demonstrate the concept:

https://github.com/sagenschneider/pgsql2/blob/281b7fcecd8c54a8041680d654ea9de9404427ff/src/main/java/org/postgresql/sql2/PGDataSource.java#L45 (shows how the NioLoop can be provided to the DataSource with a reasonable default)

https://github.com/sagenschneider/pgsql2/blob/281b7fcecd8c54a8041680d654ea9de9404427ff/src/main/java/org/postgresql/sql2/execution/NioLoop.java#L1 (NioLoop that may be implemented by the web server to share it's HTTP Selector with the DataSource)

https://github.com/sagenschneider/pgsql2/blob/281b7fcecd8c54a8041680d654ea9de9404427ff/src/main/java/org/postgresql/sql2/communication/ProtocolV3.java#L65 (using NioLoop for communication to database)

@sagenschneider
Copy link
Contributor Author

sagenschneider commented Aug 2, 2018

As per PR #19 providing the following write up on my shallow understanding of ADBA but thoughts on driver architecture to get high performance through put.

I see the architecture of ADBA at two levels:

  • top level is the callback/promise/completion stage object/function graph of how the interaction to the database is to be manged. This includes, sequencing operations to have them occur one after another and occur with things such as transactions. I would expect this to work similar to Promises in JavaScript that provides an improved coding style over the callback(error, result) that is typical of asynchronous operations

  • lower level that is provided the "submissions" from the top level "promise" layer to actually get them sent over the wire to PostgreSql and then return the data back to the appropriate "operation/submission callback"

Keeping these separate actually allows better focus on the problems, as network performance is all about keeping the socket buffers full for data out going and keeping the socket buffer clear for incoming data. I can talk you through the analogy with lanes of traffic I use regarding network performance (from OfficeFloor you can probably see I'm big on analogies - and why opening more lanes/connections of traffic does not help in throughput performance trying to get to the singleton city/database).

Anyway, the top level is all about handling callbacks and actioning the next "submission" within callbacks. This to me is a different coding paradigm to getting ByteBuffers full to get them over the network and pulling data off socket buffers to keep responses coming back.

From a visual point of view, I see it the following:

ADBA Promise Layer -> Submissions -> Written to ByteBuffers -> moved over network to PostgreSql

Then (order representing data coming back):

ADBA Promise Layer runs "callbacks" to application code <- appropriate Operation/Submission callbacks are triggered with the data <- Data interpreted to Operation data <- Data received in ByteBuffers <- PostgreSql responds

I see it the responsibility of the Connection to order the submissions. However, like JDBC the Connection is thread safe if threads re-use it to run a Statement.... hence the ADBA submissions could be sent very much one after each other as the application progresses through the ADBA layer promises (potentially with many submissions underway in parallel - though at network layer sequentially by above flow). Dependencies between the submissions should be managed at the Connection level (not the protocol level, which is about getting "SQL" to the database and results back)

To achieve this, I would like to consider some refactoring to the ProtocolV3 layer to be something along the lines (at very high level and broad stroke level):

  1. Network layer takes in Submissions (as ProtocolV3 currently does)
  2. Network layer puts these into a Queue (as ProtocolV3 currently does)
  3. Network layer writes the submission to the network as soon as possible (wasted time not filling the Socket buffer is lost throughput performance). To achieve this in flexible way, the Network Layer makes a call into the Submission with a context object. The context object provides means to write directly to pooled DirectByteBuffers. This reduces both copy costs and GC overheads.
  4. Once the submission is written to the network the submission is moved to the awaitingResponse queue (similar to this PR)
  5. As data comes available from the network (PostgresSql) the Submission object in the awaitingResponse queue is given access to pull the available data form the received pooled DirectByteBuffers (again reducing copy costs and GC overheads)
  6. The submissions then can trigger necessary callbacks at the ADBA top "promise" layer

This will make the architecture a lot more modular with submissions being self contained (including the data required to be send and received for that submission).

The only issue is that if submissions require chatty network communication (e.g. authentication handshake). I don't know the PostgreSql wire format too well, but I'm thinking this may be the only case of a submission requiring multiple messages back and forth. Note: I would expect a request for further data in a "ResultSet" to be another submission at the Network Layer. And possibly re-using the term "submission" from ADBA may be misleading. I don't know a better name at this point but will go with Network Submission.

If, however, there is further cases of "submission chattiness", we can flag these at the Network Layer with some "halting further submissions" from being sent "flag" until that submission is complete. I'm thinking this would likely be adequate for the more complicated scenarios (given the ADBA top level "promise" layer is responsible for the more complicated callback triggers and further submissions etc). Anyway, for these I would be interested to work through them case by case (but I'm hoping they are a small number of edge cases - but happy to be corrected on this).

Anyway, this is my thoughts (and experience of writing high performance HTTP server) being applied to my (and I admit) very shallow understanding of ADBA specification. Happy to discuss more and bounce ideas around. Given the changes, I did not want to go to much further without at least getting some discussion going on the ideas :)

@alexanderkjall @davecramer

@davecramer
Copy link
Member

@sagenschneider couple of things off the top of my head. (I'm going to have to look at your PR to understand it more fully)

The connection in JDBC is not threadsafe.
I'm not convinced that the PostgreSQL connection will send responses back in the order they were received (but I'd certainly like to find out). Think one submission requires a long running transaction and the other is a simple "select 1"
How would you deal with session states in the connection?
For instance there is no way to have one submission be in a transaction and another not.
Likewise for any number of other states: search_path, encoding to name a few

The only way I can think of to deal with this currently is to have a number of connections and multiplex the submissions across the open connections.

@sagenschneider
Copy link
Contributor Author

@davecramer Connection is thread safe if creating a statement per thread

Basically, the following will run fine:

Connection conn = ...;
for (int i = 0; i < 100; i++) {
  new Thread(() -> conn.createStatement().execute("SELECT * FROM TEST").start();
}

Unfortunately, what I find is there is a synchronised method in drivers that basically blocks to only do one request at a time (so the above does perform poorly compared to a thread pool of connections).

My understanding of PostgreSql (though again shallow without inspecting the code) is that it creates a process per connection. This is evident from reading the tuning guides (about memory size to specify against number of connections). Therefore, servicing a connection is by a single process that I'm guessing is not multi-threaded to process requests in parallel. My guess is that it would sequentially process the requests as they come in over the wire and send responses back as the process gets it (basically maintaining order - allowing pipelining). Again, this is unconfirmed however reading https://github.com/reactiverse/reactive-pg-client it claims half way down the list "command pipelining" (and likely reason it scores 150K times faster than the next driver - https://www.techempower.com/benchmarks/#section=data-r16&hw=ph&test=db ). Plus there have been discussions in the TechEmpower forums about allowing batching given Vertx (reactive-pg-client) pipelines (sadly you can't beat the through put of a flow of traffic when "your bridge only allows you to drive one car over it at a time").

In regards to session states in the connection, that is the responsibility of the ADBA "promise" layer to organise the order. Basically, for me asynchronous code is functional coding (not OO). Unless a dependency is created between the functions (callbacks) then the compiler is free to execute the methods as it sees fit. Hence the following code would be indeterminate about whether the update occurs within the transaction or not (excuse if syntax slightly off from ADBA):

connection.beginTransaction().submit().outOperation("SELECT * FROM TEST").commit().submit();
connection.countOperation("UPDATE TEST SET NAME = 'SAME').submit();

Ideally, there might be an implicit then relationship created. However, if not... then establishing the transaction takes time and the Update statement may actually get into the transaction due to delays in callbacks actioning the next item in each chain.

This is an inherent problem with asynchronous I/O and why it can actually get very hairy trying to debug (and possibly where ADBA will have a lot of troubles - in managing data changes). For me, when it comes to updating and transactions, JPA for ease of development wins hands down. JPA, unfortunately, expects blocking I/O. It is this reason I also smile when I here about JavaScript being great for REST applications. I always ask to see the relational database examples - seems to always be missing... but works great with NoSql solutions that have no transaction support ;)

For me a typical application in the future might be:

  • JPA on a connection pool for updating and managing data changes
  • ADBA over limited connections for heavy read-only data lifting such as reporting

So, I totally agree with you about state of the connection. But I don't believe that is a network problem. It is more a problem at the connection level to ensure chains of callbacks keep the connection in the appropriate state for operations.

@davecramer
Copy link
Member

Well just because Oracle says it's threadsafe doesn't make it so.... As you found out Drivers will synchronize requests.

As for PostgreSQL, yes it does create a process per connection, and many things are local to that session. Temporary tables, schemas, transaction, many user configurations...

select * from pg_settings where context='user' 

To see all of the settings that can be changed in a session.

connection.beginTransaction().submit().outOperation("SELECT * FROM TEST").commit().submit();
connection.countOperation("UPDATE TEST SET NAME = 'SAME').submit();

if the above allows the update to occur in the transaction then I fail to see how this is useful for anything other than a read only database ?

I don't want to hinder this developmen, but I would like to see people think about this rationally, hence my bringing up some issues

Cheers

@sagenschneider
Copy link
Contributor Author

@davecramer yes, good question and one I have regarding the usefulness of ADBA for updating (especially as per my explanation of JPA advantages in #19 ).

The above example, however, is basically poor code by the application developer.

One key example of ADBA over JDBC is possibly in faster loading data. Take for example the following:

    var transaction = connection.beginTransaction().submit();
    List<Submissions> inserts = new ArrayList<Submission>();
    for (int i = 0; i < 100; i++) {
        inserts.add(transaction.insertOperation("INSERT INTO TEST ( ID ) VALUES (" + i + ")").submit())
    }
    waitAll(inserts).commit().submit();

Note: the inserts is just one table. However, effectively you could chain together a lot of tables to have data inserted and allow them all to start firing off in parallel (as there is no blocking then between the insert chains). Then once all chains of inserts are complete, you complete the transaction.

What I'm trying to say is keep the network layer and "network submissions" separate from the chaining of submissions within the Connection. The protocol/network layer is focused on efficient throughput of the submissions from the Connection layer to the database. The Connection layer is responsible for appropriate state transitions in it's chained handling.

@alexanderkjall
Copy link
Collaborator

alexanderkjall commented Aug 2, 2018 via email

@davecramer
Copy link
Member

@sagenschneider

What the pgjdbc driver does with your example above is the inserts are transformed to

insert into foo (id) values (1),(2),(3), (4), ... (n)

See pgjdbc/pgjdbc#491 for some more information.

I'd like to see if your method is actually faster though...

@sagenschneider
Copy link
Contributor Author

@alexanderkjall yes, async and threading are not similar. OfficeFloor actually provides a multi-threaded event loop (and it was not easy to write). Most asynchronous event loops are singled threaded - your GUI being a perfect example of asynchronous event loop :) NodeJS being another. Please excuse me if bringing up threading is confusing the matter regarding chaining asynchronous operations together to improve getting them over the wire to the database server. Totally agreeing with what you are saying 👍

@davecramer yes for this example it might do that, however a more realist example might be:

    var transaction = connection.beginTransaction().submit();
    List<Submissions> inserts = new ArrayList<Submission>();
    for (int i = 0; i < 100; i++) {
         var user = transaction.insertOperation("INSERT INTO USER ( ID ) VALUES (" + i + ")"));
         inserts.add(user.insertOperation("INSERT INTO ADDRESS ( USER_ID, ADDRESS) VALUES ( ?, '1 Acme Way')").param(1, user.subscribe("ID")));
          inserts.add(user.insertOperation("INSERT INTO PAY ( USER_ID, AMOUNT) VALLUES ( ?,  'peanuts')").param(1, user.subscribe("ID)));
    }
    waitAll(inserts).commit().submit();

Basically, all the inserts are chained so order is maintained for adhering to foreign keys. However, throughput is increased as inserts can effectively be fired asynchronously in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants