Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent execution results #12

Closed
IcanDivideBy0 opened this issue Feb 27, 2020 · 19 comments
Closed

Inconsistent execution results #12

IcanDivideBy0 opened this issue Feb 27, 2020 · 19 comments

Comments

@IcanDivideBy0
Copy link

IcanDivideBy0 commented Feb 27, 2020

I've noticed that sometime batches works, and some times not... I'm using a very simple schema at the moment, sending the same request multiple time can give different results in how the loader is executed:

query foo {
  orders {
    id
    user { id }
  }
}

I'm using a loader on the orders.user field:

#[juniper::graphql_object(Context = Context)]
impl Order {
    fn id(&self) -> juniper::ID {
        self.id.to_string().into()
    }

    async fn user(&self, ctx: &Context) -> types::User {
        ctx.loaders.user.load(self.user.clone()).await.unwrap()
    }
}

and the UserLoader is basically a copy past of the example in juniper doc

db contains 2 orders, both having the same user field. Here is some logs of the same request send multiple times:

 DEBUG hyper::proto::h1::io                > read 643 bytes
 DEBUG hyper::proto::h1::io                > parsed 14 headers
 DEBUG hyper::proto::h1::conn              > incoming body is content-length (109 bytes)
 DEBUG hyper::proto::h1::conn              > incoming body completed
 DEBUG syos::graphql::utils::loaders::user > load batch [ObjectId(593fdd2ba9c0edf74ff0b38c), ObjectId(593fdd2ba9c0edf74ff0b38c)]
 INFO  GraphQL                             > 127.0.0.1:45080 "POST /graphql HTTP/1.1" 200 "http://127.0.0.1:4444/graphiql" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" 9.229761ms
 DEBUG hyper::proto::h1::io                > flushed 372 bytes


 DEBUG hyper::proto::h1::io                > read 643 bytes
 DEBUG hyper::proto::h1::io                > parsed 14 headers
 DEBUG hyper::proto::h1::conn              > incoming body is content-length (109 bytes)
 DEBUG hyper::proto::h1::conn              > incoming body completed
 DEBUG syos::graphql::utils::loaders::user > load batch [ObjectId(593fdd2ba9c0edf74ff0b38c)]
 DEBUG syos::graphql::utils::loaders::user > load batch [ObjectId(593fdd2ba9c0edf74ff0b38c)]
 INFO  GraphQL                             > 127.0.0.1:45080 "POST /graphql HTTP/1.1" 200 "http://127.0.0.1:4444/graphiql" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" 12.157163ms
 DEBUG hyper::proto::h1::io                > flushed 372 bytes


 DEBUG hyper::proto::h1::io                > read 643 bytes
 DEBUG hyper::proto::h1::io                > parsed 14 headers
 DEBUG hyper::proto::h1::conn              > incoming body is content-length (109 bytes)
 DEBUG hyper::proto::h1::conn              > incoming body completed
 DEBUG syos::graphql::utils::loaders::user > load batch [ObjectId(593fdd2ba9c0edf74ff0b38c), ObjectId(593fdd2ba9c0edf74ff0b38c)]
 INFO  GraphQL                             > 127.0.0.1:45080 "POST /graphql HTTP/1.1" 200 "http://127.0.0.1:4444/graphiql" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" 12.120952ms
 DEBUG hyper::proto::h1::io                > flushed 372 bytes


 DEBUG hyper::proto::h1::io                > read 643 bytes
 DEBUG hyper::proto::h1::io                > parsed 14 headers
 DEBUG hyper::proto::h1::conn              > incoming body is content-length (109 bytes)
 DEBUG hyper::proto::h1::conn              > incoming body completed
 DEBUG syos::graphql::utils::loaders::user > load batch [ObjectId(593fdd2ba9c0edf74ff0b38c), ObjectId(593fdd2ba9c0edf74ff0b38c)]
 INFO  GraphQL                             > 127.0.0.1:45080 "POST /graphql HTTP/1.1" 200 "http://127.0.0.1:4444/graphiql" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" 10.009887ms
 DEBUG hyper::proto::h1::io                > flushed 372 bytes

We can see that some times load batch is called with 2 ids, some times called twice with the same id.

More infos:

  • tokio runtime (tried both threaded_scheduler and basic_scheduler with the same result)
  • warp http server
  • juniper master branch
  • mongodb database (for which the driver is not async yet)
  • context is created per request (and so are loaders)

I can provide more code if necessary, or maybe even a simple repo to reproduce the issue.

@IcanDivideBy0 IcanDivideBy0 changed the title Inconsistent results Inconsistent execution results Feb 27, 2020
@IcanDivideBy0
Copy link
Author

Ok, playing a bit more around this, it feels like this might be caused by the database answering too fast... before the juniper executor gets to resolve the second order, so this might be a non-issue on a real production server.

@IcanDivideBy0
Copy link
Author

After some more thought, this issue can't be related to the database or anything related to my BatchFn impl, BatchFn::load calls should be batched.

When calling the loader in a non-async resolver, things are batched properly 100% of the time, ie:

fn user(&self, ctx: &Context) -> Option<types::User> {
    let fut = ctx.loaders.user.load(self.user.clone());
    tokio::task::spawn(async {
        let _user = fut.await.unwrap();
    });

    None
}

So now I'm very suspicious to this line

Maybe rescheduling at the event loop is not enouth, juniper might not have called the resolver of the neighbor field yet ?

@IcanDivideBy0
Copy link
Author

Adding a delay after the LoadFuture creation gives correct results.

    async fn user(&self, ctx: &Context) -> types::User {
        let fut = ctx.loaders.user.load(self.user.clone());
        tokio::time::delay_for(std::time::Duration::from_nanos(1)).await;
        fut.await.unwrap()
    }

The whole delay thing inside the LoadFuture::poll can even be totally removed then:

pub struct LoadFuture<K, V, E, F> {
    id: usize,
    stage: Stage,
    state: Arc<Mutex<State<K, Result<V, LoadError<E>>, F, BatchFuture<V, E>>>>,
}

impl<K, V, E, F> Future for LoadFuture<K, V, E, F>
where
    E: Clone,
    F: BatchFn<K, V, Error = E>,
{
    type Output = Result<V, LoadError<E>>;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
        let state = self.state.clone();
        let mut st = state.lock().unwrap();

        if st.loaded_vals.contains_key(&self.id) {
            self.stage = Stage::Finished;
            return Poll::Ready(st.loaded_vals.remove(&self.id).unwrap());
        }

        if let Some(batch_id) = st.loading_ids.get(&self.id) {
            let batch_id = *batch_id;
            ready!(st.poll_batch(cx, batch_id));
            self.stage = Stage::Finished;
            return Poll::Ready(st.loaded_vals.remove(&self.id).unwrap());
        }

        let batch_id = st.dispatch_new_batch(self.id);
        ready!(st.poll_batch(cx, batch_id));
        self.stage = Stage::Finished;
        return Poll::Ready(st.loaded_vals.remove(&self.id).unwrap());
    }
}

@IcanDivideBy0
Copy link
Author

IcanDivideBy0 commented Feb 29, 2020

From what I understand, the best place to put a delay to push back on the execution stack is in juniper itself here, right before the resolve_field_async; and keeping the delay here in the LoadFuture::poll. But it doesn't make much sense to put this in juniper for the purpose of this crate.

Does somebody have any thoughts on this ?

@kolloch
Copy link

kolloch commented Mar 17, 2020

Instead of the delay, I would propose yield_now:

https://docs.rs/tokio/0.2.13/tokio/task/fn.yield_now.html

I literally just had a look at this crate 10min ago so I am not ready to comment more but I'd also be interested in consistent behavior with async. I don't see the conceptional need for non-determinism there. (But I might be wrong).

@cksac
Copy link
Owner

cksac commented Mar 23, 2020

I released new version 0.8 using async-await and some breaking change. see if you want to try it out.

@IcanDivideBy0
Copy link
Author

Ok I tested with 0.8
From what I see, the dataloader is now consistently broken :/ I can't manage to get a single batch functioning.

Also the new API using async_trait makes it really hard to work with generics, the only way to get it compile without the parameter type `T` may not live long enough is:

pub struct ModelBatcher;

impl<T> BatchFn<ObjectId, Option<T>> for ModelBatcher
where
    T: Model,
{
    type Error = ();

    fn load<'a, 'b, 'c>(
        &'a self,
        keys: &'b [ObjectId],
    ) -> Pin<Box<dyn Future<Output = HashMap<ObjectId, Result<Option<T>, Self::Error>>> + Send + 'c>>
    where
        'a: 'c,
        'b: 'c,
        Self: 'c,
    {
        log::debug!("load batch {:?}", keys);

        Box::pin(async move {
            let models = T::load_many(&keys).await;
            models
        })
    }
}

which you'll agree feels really messy

@cksac
Copy link
Owner

cksac commented Mar 24, 2020

Thanks for testing,
Seems there has some limitation on async_trait.
if T: 'static is ok, then the impl will more cleaner.

#[async_trait]
impl<T> BatchFn<ObjectId, Option<T>> for ModelBatcher
where
    T: Model + 'static
{
    type Error = ();

    async fn load(&self, keys: &[ObjectId]) -> HashMap<ObjectId, Result<Option<T>, Self::Error>> {
        println!("load batch {:?}", keys);
        T::load_many(&keys).await
    }
}

for the loader part, could you share your test code?

@IcanDivideBy0
Copy link
Author

For the async_trait, followinf this issue, maybe something can be done in the crate directly by adding where T: 'async_trait to the BatchFn trait;

Here is a reproduction of the bug: https://github.com/IcanDivideBy0/dataloader-bug
First loader call batch loads properly, when getting in the 2nd level (friend field) nothing is batched anymore.

@IcanDivideBy0
Copy link
Author

I really think the issue is more related to how juniper resolve its tree. Everything looks like once a future is resolved, juniper doesn't wait for all futures of the same level to be completed to continue descending the tree.

@cksac
Copy link
Owner

cksac commented Mar 24, 2020

@IcanDivideBy0 thanks for the demo project to reproduce.

for lifetime issue, updated BatchFn in master branch, now can add T: 'async_trait

#[async_trait]
impl<T> BatchFn<ObjectId, Option<T>> for ModelBatcher
where
    T: Model,
{
    type Error = ();

    async fn load(&self, keys: &[ObjectId]) -> HashMap<ObjectId, Result<Option<T>, Self::Error>>
    where
        T: 'async_trait,
    {
        println!("load batch {:?}", keys);
        T::load_many(&keys).await
    }
}

@IcanDivideBy0
Copy link
Author

@cksac thanks! BatchFn way more pleasant to use with generics now :) 👍

@cksac
Copy link
Owner

cksac commented Mar 24, 2020

I update the examples to call loader at same level.
junpier is able to batch load at same level, but async-graphql not able to batch at all.
I think how the fields get resolved is really matter, and not sure what can be done in this crate.

@cksac
Copy link
Owner

cksac commented Mar 24, 2020

update the example to use dataloader version 0.6.0 / 0.7.0, and it works as expected.
https://github.com/cksac/dataloader-bug

@IcanDivideBy0
Copy link
Author

That's good news! It looks like that's something that can be fixed here finally

@IcanDivideBy0
Copy link
Author

When spamming the request w/ v0.6.0, from times to times, the issue can still occurs.

Try add something like this in the BatchFn::load:

if keys.len() < 2 {
    log::error!("not batched");
    std::process::exit(1);
}

then keep Ctrl+Enter pressed in GraphiQL, in about 20 seconds I get a crash with:

 DEBUG dataloader_bug::graphql > load batch ["1", "3", "2"]
 DEBUG dataloader_bug::graphql > load batch ["friend of 2", "friend of 1"]
 DEBUG dataloader_bug::graphql > load batch ["friend of 3"]
 ERROR dataloader_bug::graphql > not batched

@IcanDivideBy0
Copy link
Author

Sorry I can't get a more consistent way to reproduce this...

@cksac
Copy link
Owner

cksac commented Mar 24, 2020

I manage to fix junpier batch load issue by calling yield_now() more times.
update https://github.com/IcanDivideBy0/dataloader-bug to use version 0.8.1 and
with user_loader: Loader::with_yield_count(UserBatch, 100), larger yield count will allow more request to append to batch but will wait longer before actual load.
however, async-graphql still not able to do batch load.

and for the size of batch, I think the load batch size send to loader between 1 to max_batch_size is expected. as the loader will perform load when caller demand the value.

@IcanDivideBy0
Copy link
Author

@cksac great, the multiple yields trick did the job! although it feels incredibly hacky I don't see any other way.
Thanks

@cksac cksac closed this as completed Apr 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants