Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client side in-memory joins #320

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Client side in-memory joins #320

wants to merge 1 commit into from

Conversation

efirs
Copy link
Collaborator

@efirs efirs commented May 22, 2023

API:

First, join object need to be created, which specifies collections and
join condition:

join := tigris.GetJoin[LeftCollModel, RightCollModel](db, "{left field}", "{right field}")

This creates a join between LeftCollModel and RightCollModel on
equality of {left field} to {right field}.

Created object then can be used to issue one or multiple read requests:

it, err := join.Read(ctx, filter.Eq("Field1", 1))

filter condition of read API is applied to the left table.
Iterator then returns the rows matching the condition along with the
corresponding rows from the right table, which satisfies the join
condition.

var l LeftCollModel
var r []*RightCollModel

for it.Next(&l, &r) {
  fmt.Printf("l=%v r=%v\n", l, r)
}

By default the documents which doesn't have matching documents in the right
table returned in the results. These results can be skipped by providing
&JoinOptions{Type: tigris.InnerJoin} option.

It is not required for the left field values or right field values to be unique.

The value of the array fields are matched as is by default, by using
&JoinOptions{ArrayUnroll: true} option individual array items can be
matched in the right table.

Implementation details:

First request is issued to the left table with filter provided to Read API.
Result is read into memory and request is prepared for the right table.
Which will have the following filter filter.Or(filter.Eq("user_id", {value of id field}), ....).

Result from the first query is put in the map with ID as the key, while reading the result
from second query we append it to the corresponding map bucket.

So as merge is done in the memory joins should be used for relatively small
result sets only.

@efirs
Copy link
Collaborator Author

efirs commented May 22, 2023

Fixes #280

@codecov
Copy link

codecov bot commented May 22, 2023

Codecov Report

Patch coverage: 75.10% and project coverage change: -0.53 ⚠️

Comparison is base (d1a4583) 84.93% compared to head (ff7f73f) 84.41%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
- Coverage   84.93%   84.41%   -0.53%     
==========================================
  Files          37       38       +1     
  Lines        3406     3631     +225     
==========================================
+ Hits         2893     3065     +172     
- Misses        365      401      +36     
- Partials      148      165      +17     
Impacted Files Coverage Δ
schema/schema.go 84.90% <ø> (ø)
tigris/join.go 72.18% <72.18%> (ø)
tigris/collection.go 71.67% <72.41%> (+1.27%) ⬆️
tigris/iterator.go 80.00% <86.66%> (+3.75%) ⬆️
driver/iterator.go 84.21% <100.00%> (+0.28%) ⬆️
filter/filter.go 84.21% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@efirs efirs force-pushed the joins branch 2 times, most recently from 2c30323 to ea15152 Compare May 22, 2023 20:07
@efirs efirs force-pushed the joins branch 5 times, most recently from c02118c to 94d8076 Compare June 4, 2023 00:46
@efirs efirs changed the title Client side in memory joins Client side in-memory joins Jun 6, 2023
API:

First, join object need to be created, which specifies collections and
join condition:
```
join := tigris.GetJoin[LeftCollModel, RightCollModel](db, "{left field}", "{right field}", [options])
```
This creates a join between LeftCollModel and RightCollModel on
equality of `{left field}` to `{right field}`.

Created object then can be used to issue one or multiple read requests:

```
it, err := join.Read(ctx, filter.Eq("Field1", 1))
```

filter condition of read API is applied to the left table.
Iterator then returns of the rows matching the condition along with the
corresponding rows from the right table, which satisfies the join
condition.

var l LeftCollModel
var r []*RightCollModel

for it.Next(&l, &r) {
  fmt.Printf("l=%v r=%v\n", l, r)
}

By default the documents which doesn't have matching documents in the right
table returned in the results. These results can be skipped by providing
`&JoinOptions{Type: tigris.InnerJoin}` option to GetJoin API.

It is not required for the left field values or right field values to be unique.

The value of the array fields are matched as is by default, by using
`&JoinOptions{ArrayUnroll: true}` option individual array items can be
matched in the right table.

Implementation details:

First request is issued to the left table with filter provided to Read API.
Result is read into memory and request is prepared for the right table.
Which will have the following filter `filter.Or(filter.Eq("{right field}", {left field value fetched by left query}), ...)`.

Result from the first query is put in the map with {left field} value as the key,
while reading the result from second query we append it to the corresponding map bucket.

So as merge is done in the memory joins should be used for relatively small
result sets only.
@garrensmith
Copy link
Contributor

The api and code look good. However, I'm not confident this should be in the client library. If we merge this in, we have to do the same for the ts and python libraries. And any other client library we support. The other concern is as you mentioned in the description that if the results exceed the users application memory it will OOM. This puts a lot of responsibility on the users to understand their data. For example, they could have a small data set that works great for months then suddenly the data set increases significantly and then their whole application crashes unexpectantly in production. A smaller nit is that because we are reading a lot from both collections it will also increase the users network costs.

I would prefer we implemented something like this on the server side, we could look at zig-zag joins to reduce the memory footprint and make sure that this will scale smoothly. This puts compute and memory risk on our servers but that is why we are the database and we should look to handle it as best as we could.

@efirs
Copy link
Collaborator Author

efirs commented Jun 7, 2023

The idea is to provide simple API, which solves common use-cases.

Server side joins have multiple issue:

  • Generic SQL joins are very complex in implementation, requires heuristic algorithms as you mentioned.
  • Complex in operations, require balancing between memory consumption and overflowing to file sorting.
  • Single misbehaving user affect all others and gives severe headache to the on-call.

Client-side in contrast:

  • Scales nicely with number of clients growing.
  • Amount of memory used is exactly as much as needed to keep the result on the client.
  • Misbehaving client is self-penalizing without affecting others.
  • No server tuning required. On-call sleeps well.

@ovaistariq
Copy link
Collaborator

I agree with @garrensmith that this is better suited on the server side. However, either way, we don't need this feature now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants