Client side in-memory joins #320

efirs · 2023-05-22T01:13:58Z

API:

First, join object need to be created, which specifies collections and
join condition:

join := tigris.GetJoin[LeftCollModel, RightCollModel](db, "{left field}", "{right field}")

This creates a join between LeftCollModel and RightCollModel on
equality of {left field} to {right field}.

Created object then can be used to issue one or multiple read requests:

it, err := join.Read(ctx, filter.Eq("Field1", 1))

filter condition of read API is applied to the left table.
Iterator then returns the rows matching the condition along with the
corresponding rows from the right table, which satisfies the join
condition.

var l LeftCollModel
var r []*RightCollModel

for it.Next(&l, &r) {
  fmt.Printf("l=%v r=%v\n", l, r)
}

By default the documents which doesn't have matching documents in the right
table returned in the results. These results can be skipped by providing
&JoinOptions{Type: tigris.InnerJoin} option.

It is not required for the left field values or right field values to be unique.

The value of the array fields are matched as is by default, by using
&JoinOptions{ArrayUnroll: true} option individual array items can be
matched in the right table.

Implementation details:

First request is issued to the left table with filter provided to Read API.
Result is read into memory and request is prepared for the right table.
Which will have the following filter filter.Or(filter.Eq("user_id", {value of id field}), ....).

Result from the first query is put in the map with ID as the key, while reading the result
from second query we append it to the corresponding map bucket.

So as merge is done in the memory joins should be used for relatively small
result sets only.

efirs · 2023-05-22T01:14:35Z

Fixes #280

codecov · 2023-05-22T01:28:47Z

Codecov Report

Patch coverage: 75.10% and project coverage change: -0.53 ⚠️

Comparison is base (d1a4583) 84.93% compared to head (ff7f73f) 84.41%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
- Coverage   84.93%   84.41%   -0.53%     
==========================================
  Files          37       38       +1     
  Lines        3406     3631     +225     
==========================================
+ Hits         2893     3065     +172     
- Misses        365      401      +36     
- Partials      148      165      +17

Impacted Files	Coverage Δ
schema/schema.go	`84.90% <ø> (ø)`
tigris/join.go	`72.18% <72.18%> (ø)`
tigris/collection.go	`71.67% <72.41%> (+1.27%)`	⬆️
tigris/iterator.go	`80.00% <86.66%> (+3.75%)`	⬆️
driver/iterator.go	`84.21% <100.00%> (+0.28%)`	⬆️
filter/filter.go	`84.21% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

API: First, join object need to be created, which specifies collections and join condition: ``` join := tigris.GetJoin[LeftCollModel, RightCollModel](db, "{left field}", "{right field}", [options]) ``` This creates a join between LeftCollModel and RightCollModel on equality of `{left field}` to `{right field}`. Created object then can be used to issue one or multiple read requests: ``` it, err := join.Read(ctx, filter.Eq("Field1", 1)) ``` filter condition of read API is applied to the left table. Iterator then returns of the rows matching the condition along with the corresponding rows from the right table, which satisfies the join condition. var l LeftCollModel var r []*RightCollModel for it.Next(&l, &r) { fmt.Printf("l=%v r=%v\n", l, r) } By default the documents which doesn't have matching documents in the right table returned in the results. These results can be skipped by providing `&JoinOptions{Type: tigris.InnerJoin}` option to GetJoin API. It is not required for the left field values or right field values to be unique. The value of the array fields are matched as is by default, by using `&JoinOptions{ArrayUnroll: true}` option individual array items can be matched in the right table. Implementation details: First request is issued to the left table with filter provided to Read API. Result is read into memory and request is prepared for the right table. Which will have the following filter `filter.Or(filter.Eq("{right field}", {left field value fetched by left query}), ...)`. Result from the first query is put in the map with {left field} value as the key, while reading the result from second query we append it to the corresponding map bucket. So as merge is done in the memory joins should be used for relatively small result sets only.

garrensmith · 2023-06-06T08:49:04Z

The api and code look good. However, I'm not confident this should be in the client library. If we merge this in, we have to do the same for the ts and python libraries. And any other client library we support. The other concern is as you mentioned in the description that if the results exceed the users application memory it will OOM. This puts a lot of responsibility on the users to understand their data. For example, they could have a small data set that works great for months then suddenly the data set increases significantly and then their whole application crashes unexpectantly in production. A smaller nit is that because we are reading a lot from both collections it will also increase the users network costs.

I would prefer we implemented something like this on the server side, we could look at zig-zag joins to reduce the memory footprint and make sure that this will scale smoothly. This puts compute and memory risk on our servers but that is why we are the database and we should look to handle it as best as we could.

efirs · 2023-06-07T02:21:48Z

The idea is to provide simple API, which solves common use-cases.

Server side joins have multiple issue:

Generic SQL joins are very complex in implementation, requires heuristic algorithms as you mentioned.
Complex in operations, require balancing between memory consumption and overflowing to file sorting.
Single misbehaving user affect all others and gives severe headache to the on-call.

Client-side in contrast:

Scales nicely with number of clients growing.
Amount of memory used is exactly as much as needed to keep the result on the client.
Misbehaving client is self-penalizing without affecting others.
No server tuning required. On-call sleeps well.

ovaistariq · 2023-06-07T04:37:23Z

I agree with @garrensmith that this is better suited on the server side. However, either way, we don't need this feature now.

efirs force-pushed the joins branch 2 times, most recently from 2c30323 to ea15152 Compare May 22, 2023 20:07

efirs force-pushed the joins branch 5 times, most recently from c02118c to 94d8076 Compare June 4, 2023 00:46

efirs changed the title ~~Client side in memory joins~~ Client side in-memory joins Jun 6, 2023

efirs force-pushed the joins branch from 94d8076 to ff7f73f Compare June 6, 2023 00:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client side in-memory joins #320

Client side in-memory joins #320

efirs commented May 22, 2023 •

edited

efirs commented May 22, 2023

codecov bot commented May 22, 2023 •

edited

garrensmith commented Jun 6, 2023

efirs commented Jun 7, 2023

ovaistariq commented Jun 7, 2023

Client side in-memory joins #320

Are you sure you want to change the base?

Client side in-memory joins #320

Conversation

efirs commented May 22, 2023 • edited

efirs commented May 22, 2023

codecov bot commented May 22, 2023 • edited

Codecov Report

garrensmith commented Jun 6, 2023

efirs commented Jun 7, 2023

ovaistariq commented Jun 7, 2023

efirs commented May 22, 2023 •

edited

codecov bot commented May 22, 2023 •

edited