Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncDataloader freezes #4767

Open
fazo96 opened this issue Jan 9, 2024 · 7 comments
Open

AsyncDataloader freezes #4767

fazo96 opened this issue Jan 9, 2024 · 7 comments

Comments

@fazo96
Copy link

fazo96 commented Jan 9, 2024

Describe the bug

When using AsyncDataloader instead of the normal one, operations that involve a Source will freeze and the HTTP request never completes.

Versions

graphql version: 2.2.4
rails (or other framework): 7.1.2
graphql-pro version: 1.25.2
puma version: 6.4
async version: 2.8

GraphQL schema

This seems to happen when using any kind of dataloader source.

GraphQL query

Any query that involves using a dataloader source has this problem.

Steps to reproduce

Enable the async dataloader on a rails 7.1 project that uses dataloader sources, then try doing a query that involves reading from a source

Expected behavior

The query works correctly, as it does with the regular dataloader.

Actual behavior

The HTTP request and ActiveRecord read freeze. On PostgreSQL, the query appears stuck with wait status ClientRead when checking pg_stat_activity.

On the rails process, the HTTP request is never resolved and the ruby code is frozen in results.find { |r| r.id == id } in the fetch method of the Source.

Additional context

The problem occurs regardless of isolation level being set to fiber or not. It only occurs with AsyncDataloader and never occurs with the regular Dataloader.

I looked at other issues and followed the development of the feature and it seems it works for other people, so I am looking for some help to figure out why it does not work for us

@rmosolgo
Copy link
Owner

rmosolgo commented Jan 9, 2024

Hi! Yes, I don't think it freezes for any operation that uses a Dataloader source, but maybe we can narrow down the problem and find out why it's not working for you. Could you please share:

  • The smallest GraphQL query you can find that replicates this bug
  • The Ruby source code for the fields and dataloader sources involved in it
  • The Ruby stack trace right before the line of code that freezes (eg using puts caller)

That will give a lot of information about what's going on when it freezes and maybe we can work up a replication script to figure out what's going wrong 👍

@fazo96
Copy link
Author

fazo96 commented Jan 9, 2024

Here's the stack trace:

about to do 'puts caller' and then call profiles.find
/workspaces/flexhire/backend/app/graphql/sources/profile_for_user.rb:5:in `map'
/workspaces/flexhire/backend/app/graphql/sources/profile_for_user.rb:5:in `fetch'
/usr/local/rvm/gems/ruby-3.2.2/gems/graphql-2.2.4/lib/graphql/dataloader/source.rb:136:in `run_pending_keys'
/usr/local/rvm/gems/ruby-3.2.2/gems/graphql-2.2.4/lib/graphql/dataloader/async_dataloader.rb:79:in `each'
/usr/local/rvm/gems/ruby-3.2.2/gems/graphql-2.2.4/lib/graphql/dataloader/async_dataloader.rb:79:in `block in spawn_source_task'
/usr/local/rvm/gems/ruby-3.2.2/gems/async-2.8.0/lib/async/task.rb:161:in `block in run'
/usr/local/rvm/gems/ruby-3.2.2/gems/async-2.8.0/lib/async/task.rb:331:in `block in schedule'
  Profile Load (3.5ms)  SELECT "profiles".* FROM "profiles" WHERE "profiles"."user_id" = $1  [["user_id", 4607]]
  ↳ app/graphql/sources/profile_for_user.rb:8:in `block in fetch'

The query:

query {
  currentUser {
    profile { id }
  }
}

The relevant code:

# user_type.rb
class UserType
  field :profile, ProfileType, description: "for members only. Associated profile data"

  # ... omited irrelevant code

  def profile
    context.dataloader.with(Sources::ProfileForUser).load(object.id)
  end
end

# profile_for_user.rb
class Sources::ProfileForUser < GraphQL::Dataloader::Source
  def fetch(users_ids)
    profiles = Profile.where(user_id: users_ids)

    users_ids.map do |user_id|
      puts "about to do 'puts caller' and then call profiles.find"
      puts caller
      profiles.find { |p| p.user_id == user_id }
    end
  end
end

The profiles.find call is the one that triggers the lazy evaluation of the SQL query in active record, so I am thinking the problem might not be in graphql-ruby. However using active record in other ways seems to work

@fazo96
Copy link
Author

fazo96 commented Jan 9, 2024

I actually noticed it didn't freeze. It eventually finished executing

image

It took over 3 minutes. This HTTP request takes around 90ms normally

EDIT: also I notice that when it "freezes", one logical CPU thread of my machine is at 100% and the process causing it is the puma web server. So something weird is happening in ruby.

In the logs, active record reports the queries as taking only a few ms

@rmosolgo
Copy link
Owner

rmosolgo commented Jan 9, 2024

Interesting -- thanks for sharing all those details. My first thought was, maybe it was stuck in the while loop here:

while (task = source_tasks.shift || spawn_source_task(root_task, sources_condition))
if task.alive?
root_task.yield # give the source task a chance to run
next_source_tasks << task
end
end

But, I see that there's a root_task.yield which is meant to address this exact situation: tell Async's event loop to hand off to another task (to see if it's finished).

Unfortunately my suggestion for puts caller was worthless; I forgot that it'd be in a brand new caller and therefore have nothing useful in the backtrace 😖

I'm going to try to replicate this locally. In the meantime, I'm wondering if you could try a few experiments to help us isolate exactly where it's going wrong:

  • Remove the dataloader.with(...).load(...) call, and instead, call Profile.find directly. Does that work as expected? (If so, then we can be sure it has to do with the AsyncDataloader -> Source flow.)
  • When using dataloader.with(...).load(...), what if you change def fetch to call users_ids.map { |id| Profile.find(id) } directly? (Obviously that's bad ActiveRecord, but I'm curious if it behaves differently with a direct query.)
  • Also, could you try a rbspy snapshot while the query is running? https://rbspy.github.io/using-rbspy/snapshot.html That would let us sneak a peek a the top-level backtrace while it's doing ... whatever it's doing!

Let me know what you find -- I'll try to replicate on my end in the meantime!

@rmosolgo
Copy link
Owner

rmosolgo commented Jan 9, 2024

I wrote a replication script, but it worked fine:

Querying for currentUser.profile.id with `AsyncDataloader`

require 'bundler/inline'

gemfile do
  source "https://rubygems.org"
  gem "graphql", "2.2.4"
  gem "rails", "7.1.2"
  gem "postgresql"
  gem "async", "~>2.8.0"
end

require "active_record"

ActiveSupport::IsolatedExecutionState.isolation_level = :fiber

ActiveRecord::Base.establish_connection("postgres://postgres:@localhost/postgres")

ActiveRecord::Schema.define do
  self.verbose = false
  create_table :profiles, force: true do |t|
    t.integer :user_id
  end
end

class Profile < ActiveRecord::Base
end

Profile.create!(user_id: 5)

class MySchema < GraphQL::Schema
  class ProfileSource < GraphQL::Dataloader::Source
    def fetch(users_ids)
      profiles = ::Profile.where(user_id: users_ids)
      users_ids.map do |uid|
        profiles.find { |pr| pr.user_id == uid }
      end
    end
  end

  class Profile < GraphQL::Schema::Object
    field :id, ID
  end
  class User < GraphQL::Schema::Object
    field :profile, Profile

    def profile
      dataloader.with(ProfileSource).load(object.id)
    end
  end
  class Query < GraphQL::Schema::Object
    field :current_user, User
  end
  query(Query)
  use GraphQL::Dataloader::AsyncDataloader
end

query_str = <<-GRAPHQL
{
  currentUser {
    profile {
      id
    }
  }
}
GRAPHQL

data = { current_user: OpenStruct.new(id: 5) }
pp MySchema.execute(query_str, root_value: data).to_h
# {"data"=>{"currentUser"=>{"profile"=>{"id"=>"1"}}}}

So... we have to find some more "moving parts" to identify what's going wrong here!

@fazo96
Copy link
Author

fazo96 commented Jan 10, 2024

@rmosolgo I ran those tests

Remove the dataloader.with(...).load(...) call, and instead, call Profile.find directly - Works

When using dataloader.with(...).load(...), what if you change def fetch to call users_ids.map { |id| Profile.find(id) } directly? FAIL

For this one I used this implementation:

class Sources::ProfileForUser < GraphQL::Dataloader::Source
  def fetch(users_ids)
    users_ids.map do |user_id|
      Profile.find_by(user_id:)
    end
  end
end

And it doesn't work, it gets stuck on line 4: Profile.find_by(user_id:) - seems that active record operations will hang inside a dataloader source

here is a snapshot while the query is running:

<main> - /workspaces/flexhire/backend/bin/rails:4
require - /usr/local/rvm/gems/ruby-3.2.2/gems/bootsnap-1.12.0/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:34
require [c function] - (unknown)
<main> - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/commands.rb:18
invoke - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command.rb:83
with_argv - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command.rb:152
block in invoke - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command.rb:75
perform - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command/base.rb:74
dispatch - /usr/local/rvm/gems/ruby-3.2.2/gems/thor-1.3.0/lib/thor.rb:528
invoke_command - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command/base.rb:181
invoke_command - /usr/local/rvm/gems/ruby-3.2.2/gems/thor-1.3.0/lib/thor/invocation.rb:129
run - /usr/local/rvm/gems/ruby-3.2.2/gems/thor-1.3.0/lib/thor/command.rb:38
perform - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/commands/server/server_command.rb:150
tap - unknown:92
block in perform - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/commands/server/server_command.rb:149
start - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/commands/server/server_command.rb:41
start - /usr/local/rvm/gems/ruby-3.2.2/gems/rack-2.2.8/lib/rack/server.rb:328
run - /usr/local/rvm/gems/ruby-3.2.2/gems/puma-6.4.0/lib/rack/handler/puma.rb:82
run - /usr/local/rvm/gems/ruby-3.2.2/gems/puma-6.4.0/lib/puma/launcher.rb:197
run - /usr/local/rvm/gems/ruby-3.2.2/gems/puma-6.4.0/lib/puma/single.rb:67
join [c function] - (unknown)

I am not getting any insight from this snapshot, however to be honest the ruby ecosystem never really clicked with me so I guess you will need to be patient 😄

@rmosolgo
Copy link
Owner

Ugh, bummer -- the snapshot just showed the top-level thread waiting for the Fiber to finish. Not helpful 😖

I'm not exactly sure where to look next. I'd say the Ruby ecosystem did click with me, but using Fibers this way is pretty new!

I'm going to think about more debugging options and follow up here.

In the meantime, I documented some other approaches for adding parallelism to GraphQL::Dataloader here: https://graphql-ruby.org/dataloader/parallelism.html. If try any of those out, please let me know how it goes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants