Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are reads guaranteed to see previous writes in same transaction? #126

Open
3noch opened this issue Jun 30, 2019 · 18 comments
Open

Are reads guaranteed to see previous writes in same transaction? #126

3noch opened this issue Jun 30, 2019 · 18 comments

Comments

@3noch
Copy link

3noch commented Jun 30, 2019

I listened to an interview about Noria in which it was explained that a writer may actually not immediately witness it's own writes when reading. Surely this is not true within the same transaction?

@jonhoo
Copy link
Contributor

jonhoo commented Jul 1, 2019

Hi! Noria does not have transactions in the traditional sense. Specifically, as the client you cannot say "read this, write this, read this, write this" as an atomic unit. If you are using transactions to maintain derived values, Noria can do that correctly for you, but there are no general-purpose transactions. So you are indeed not guaranteed to see your own writes.

That said, we are looking at ways in which we might provide a feature like this, but it is currently an open research question how to do this efficiently.

@3noch
Copy link
Author

3noch commented Jul 1, 2019

Thanks very much for the clarification. I noticed the absence of the term "transaction" in the documentation and wondered what that could mean. It would probably be worth clarifying this early on in the README etc. Lack of transactions makes Noria nonviable for my use case.

About the problem itself: Transactions are so important I'd even think about how to do them inefficiently. For a traditional web app the vast majority of reads are not in transactions. But when you need a transaction, you really need it.

@3noch
Copy link
Author

3noch commented Jul 1, 2019

The alternative, of course, is to redesign your schema to handle your atomic needs manually, but this then means that Noria is no longer a feasible tool in existing applications.

@jonhoo
Copy link
Contributor

jonhoo commented Jul 1, 2019

Noria is also a research prototype, so we're not expecting anyone to pick it up and use it in production any time soon :)

As for inefficiently supporting transactions, the model is such that there isn't a straightforward way to do that. It requires some deeper changes, as Noria is specifically engineered to not require coordination. Or, phrased somewhat differently, we have found that the web applications we've looked at haven't generally needed transactions, so we decided not to prioritize them. In fact, most of the applications we've looked at work just fine even with eventual consistency (e.g., they frequently employ a cache), so that's the model Noria also targets. As mentioned though, we are looking into how that might be done without adding undue cost to the common non-transactional case.

I'd be curious to hear examples of where you need a transactions in a web application context. We've found that a decent number of transactions can instead be encoded in the query graph, which provides many of the same guarantees as a transaction would.

@3noch
Copy link
Author

3noch commented Jul 1, 2019

If there's an alternative encoding that will work then I'd be happy to know about it. Here are a few examples of where transactions are very useful:

  • An email queue: You want a worker thread to send an email from a queue, but only if it's not been sent before. You can read from the queue, fork a sending thread, and then write to the queue that you've started processing that email. This prevents other threads/servers from trying to send the same email.
  • Doing a single "logical" read but over several queries: This is really a symptom of SQL's lack of expressiveness but selecting "sum" types out of a table can be quite miserable. What you often end up doing is just splitting your query into multiple selects and parsing the results separately. Doing this in a single select would require that you unify your projection across all cases, which can be quite tricky at the SQL level.

@3noch
Copy link
Author

3noch commented Jul 1, 2019

I can imagine that certain "categories" of transaction would be much easier to implement. For example, all permutations of the "read/write" ordering:

  • read then write
  • write then read
  • read then read
  • write then write

If you can support even some of these, it would be a huge win.

@jonhoo
Copy link
Contributor

jonhoo commented Jul 1, 2019

For the e-mail queue, I think the equivalent encoding is to define a view that is "job" left joined with a "SELECT job, worker FROM claims ORDER BY id ASC LIMIT 1". Read the first non-claimed job out of that view, and then do an insert into claims with that job's id and your worker id. Then read your chosen job until you see that job as claimed. If it is recorded as claimed by you, you now own it, otherwise it was claimed by someone else first, and you retry.

For the single logical read, I'm not entirely sure what you mean? If it's just "selecting from multiple tables", then I don't think you've told me why you need the reads from the different tables to be consistent with one another? Why can't you just do the reads separately?

I think read-your-own-writes is the first thing we're going to tackle, because it seems like an obviously useful use-case. I personally prefer to think about explicit use-cases rather than "we should have these primitives", since that motivates the work better, and avoids us implementing something that isn't needed in practice :) This is especially true for Noria where we are not aiming for it to be a general-purpose database. It is specifically only intended to replace the backends for ready-heavy, interactive web applications. Arguably your e-mail queue example doesn't fit that for example — you'd probably be better off using a specialized job queue management system like Faktory for that.

@3noch
Copy link
Author

3noch commented Jul 1, 2019

The multiple-reads scenario is a bit nuanced: Presuming you're familiar with a tiny bit of Haskell, let's imagine you want to read out of a table into this type: [Either ARecord BRecord]. This data could be in the same table with some sort of tagged union (a record_type column, perhaps) or, more likely, it's in two tables with some common third table. Regardless, what you want to do is effectively

lefts <- query("SELECT id, aField1, aField2 WHERE record_type = a ORDER BY id")
rights <- query("SELECT id, bField1, bField2 WHERE record_type = b ORDER BY id")
return (sortOn idField (map Left lefts ++ map Right rights))

Surely contrived in this case, but I've seen more complex use cases arise like this. However, try doing this in a single query. It's a real pain! However, since, in the end, these two queries are 1) querying the same data and 2) ending up in the same structure in the application, the possibility of them representing different points in time is quite troubling in some uses cases. Transactions make this a non-issue.

@3noch
Copy link
Author

3noch commented Jul 1, 2019

Yes the polling option would work, but of course that's not very pleasant. The reads would be super fast, but you'd be delayed by whatever polling frequency you chose. If Noria could tell you that your query had changed (even without telling you what had changed), then we could avoid that. I'm pretty sure Noria has the infra to do that, no?

@3noch
Copy link
Author

3noch commented Jul 1, 2019

Re: Faktory. I don't like splitting out work queues from the normal DB for the exact reason that I can't do anything transactional across the "platform" boundary. I.e. I can't write an email queue that also joins on the user's "enabled" field in a transactional way.

@jonhoo
Copy link
Contributor

jonhoo commented Jul 1, 2019

Yup, Noria (at least in theory) could totally tell you when the results change. It's not something that's currently supported, but shouldn't be too hard to add. It certainly meshes well with the model.

As for your multi-read query, in Noria you would just do those two queries separately. There's no need for a single query. Noria does assume however that it's okay for them to not happen at the exactly same logical time. Can you describe why having them happen at slightly different times would be an issue for your application?

I guess we disagree somewhat on that point, but it's probably not relevant to the discussion. Job queue management is hard enough to get right that it's usually a good idea to use a separate system, just like search is often better handled by an external system like Lucene. It is true that that means you can't do transactions across it, but I think that at large scale you probably won't be able to do these kinds of transactions anyway (see Google's Spanner for example, which only has row-level transactions). Getting some concrete examples of why you need these transactions would be helpful though!

@3noch
Copy link
Author

3noch commented Jul 1, 2019

Well...with Noria's speed the threshold for needing to move to a different system could get a lot higher, meaning a lot fewer apps need to actually use something other than a single DB!

If Noria gave me a ping when I needed to reread my query, then you could easily get back "writers see their own writes." The client side just becomes a traditional async library with futures or something. The write would have to be given some sort of unique ID sent back so that the listener could wait for the right write.

@jonhoo
Copy link
Contributor

jonhoo commented Jul 1, 2019

The issue with giving read-your-own-writes currently is actually that views don't have a well-defined time for you to query. Even if we gave an ID back with every write, that doesn't help unless you can also query a view for what IDs it includes. And it turns out that maintaining that information is actually pretty tricky! Hence, an open research question :p

@3noch
Copy link
Author

3noch commented Jul 1, 2019

IIRC from the interview, you do know when a particular write gets applied to a view though? Probably too complex to discuss here. LMK if I should convert this discussion into anything: issues, docs, etc.

@3noch
Copy link
Author

3noch commented Jul 1, 2019

Another idea is just for Noria to support being a MySQL replication server. Then you can put all your transactions on the master and use Noria for basic reads. Not sure if this undermines anything...

@jonhoo
Copy link
Contributor

jonhoo commented Jul 1, 2019

We don't actually! Or rather, Noria propagates all effects of writes throughout all views, but it doesn't know which write any given update corresponds to. A view learns, for example, that the vote count for some article has changed from 7 to 8, but it doesn't know that this was caused by a write with id AA73C82, for example.

I don't think there's much to do in terms of issues/docs here. This is all in the realm of ongoing/potential research, so those likely won't help much. It's useful to learn about particular common use-cases we should keep in mind going forward though :)

There is, at least in theory, nothing stopping you from using MySQL as the backing store for Noria as long as you can produce a stream of updates to base tables and are willing to have Noria re-materialize the inputs. If you want to exclusively use MySQL for storage, that raises some biggest questions about how Noria maintains invariants when the backing data can change between when you receive an update and when you query the backing data (e.g., for a join), which we haven't looked into very closely.

@ms705
Copy link
Member

ms705 commented Jul 1, 2019

It's worth pointing out that there exist other dataflow systems that do have the information to track particular writes and when they're applied to views. For example, differential dataflow has a centralized progress tracking protocol based on partially-ordered logical times, although doing this comes at some scalability cost (at fine granularity, every update must coordinate with the centralized progress tracker at every vertex). With that information, you can build transactions on top of a dataflow system.

In Noria, we've deliberately omitted such logical timestamps (or more coarse-grained watermarks), as we wanted to see what the maximum performance and scalability attainable are, and then add stronger notions of ordering while measuring the overhead they impose. In principle, concepts from multi-version concurrency control (MVCC) apply here, and should make it feasible to implement at least restriction notions of transactions at very modest overhead cost. (Indeed, an earlier Noria prototype had support for optimistically-concurrent transactions, but we removed that code to make optimizing the first version easier.)

@3noch
Copy link
Author

3noch commented Jul 1, 2019

Very cool. Thanks for explaining the trajectory! Noria is a really exciting project. I see it as a potential low-cost solution to the CQRS problem too, which would be all but life changing to most web apps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants