Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to find subscription with identifier: {"channel":"GraphqlChannel","channelId":"18bccb1dbd6"} #4702

Open
Amnesthesia opened this issue Nov 15, 2023 · 6 comments

Comments

@Amnesthesia
Copy link

Describe the bug

This is possibly not a bug in graphql-ruby, and I apologize in advance if that's the case, but I've just run out of ideas on what this could be and thought I'd open this bug report in case this is something anybody else has run into before, or if it's a common problem that I just can't find much information on.

Basically, we see a lot of these errors in our AppSignal monitoring, but I can't reproduce it locally. We don't know if it's something that happens when we deploy new versions and somebody is using the site with (possibly now disconnected websockets?) or what's been going on really

Versions

graphql version: 2.1.1
rails (or other framework): 7.0.8
other applicable versions (graphql-batch, etc):

graphql-pro (1.24.11)
    graphql (2.1.1)
    graphql-metrics (5.0.7)
    graphql-rails_logger (1.2.4)
    rubocop-graphql (1.4.0)
graphql (= 2.1.1)

Steps to reproduce

I'm sorry but honestly, no clue. We can't reproduce this locally, it only happens in our production and staging environments (on Amazon Fargate using ElastiCache Redis)

Expected behavior

Not to throw this error, or to find the subscription

Actual behavior

What specifically went wrong?

Place full backtrace here (if a Ruby exception is involved):

Click to view exception backtrace

lib/action_cable/connection/subscriptions.rb:75 find |  
-- | --
lib/action_cable/connection/subscriptions.rb:47 remove |  
lib/action_cable/connection/subscriptions.rb:18 execute_command |  
lib/action_cable/connection/base.rb:89 dispatch_websocket_message |  
lib/action_cable/server/worker.rb:59 block in invoke |  
lib/active_support/callbacks.rb:118 block in run_callbacks |  
lib/semantic_logger/base.rb:190 block in tagged |  
lib/semantic_logger/semantic_logger.rb:346 tagged |  
lib/semantic_logger/base.rb:202 tagged |  
lib/rails_semantic_logger/extensions/action_cable/tagged_logger_proxy.rb:8 tag |  
lib/action_cable/server/worker/active_record_connection_management.rb:16 with_database_connections |  
lib/active_support/callbacks.rb:127 block in run_callbacks |  
lib/action_cable/engine.rb:71 block (4 levels) in <class:Engine> |  
lib/active_support/execution_wrapper.rb:92 wrap |  
lib/action_cable/engine.rb:66 block (3 levels) in <class:Engine> |  
lib/active_support/callbacks.rb:127 instance_exec |  
lib/active_support/callbacks.rb:127 block in run_callbacks |  
lib/active_support/callbacks.rb:138 run_callbacks |  
lib/action_cable/server/worker.rb:42 work |  
lib/action_cable/server/worker.rb:58 invoke |  
lib/action_cable/server/worker.rb:53 block in async_invoke |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:352 run_task |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:343 block (3 levels) in create_worker |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334 loop |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334 block (2 levels) in create_worker |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:333 catch |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:333 block in create_worker
@Amnesthesia
Copy link
Author

Amnesthesia commented Nov 15, 2023

Actually, this being raised in subscriptions.rb when the identifier can't be found in the @subscriptions array, could this be because we run multiple containers, and the user may be subscribed at another container?

@rmosolgo
Copy link
Owner

Hi! Sorry for the trouble and thanks for the detailed report. What I expect on re-deploy is for clients to reconnect, re-sending GraphQL subscriptions as needed. A couple of quick thoughts:

  • What caused this error to start occurring? Did you just add GraphQL subscriptions to your app, or did something else change that caused these to appear?
  • Reviewing the backtrace above, I don't see anything GraphQL-related. Is there any indication that this is GraphQL-specific? (Are there other uses of ActionCable that work properly in your application?)
  • You mention that you can't replicate the error locally. If you're using the app in production during a deploy, do you see anything happening in your browser?
  • In AppSignal, can you see any trends in the occurences in this bug? (Maybe browser? I can imagine browsers handling this event in different ways...)

@Amnesthesia
Copy link
Author

@rmosolgo Right, so to address your questions (and what I think is going on here as well):

  1. This error has been occurring for a very long time for us, but it doesn't seem to cause any issues. We've had a lot of ideas on what it could be, and none of those have turned out to be it. We haven't really prioritized it because like I said, it seems pretty harmless
  2. You're right, and I realized after posting this here that this may not be graphql-related and could be actioncable-related
    3 & 4. We're not seeing anything related to the browser type for this, seems to be a mixed bag

I think it's that we have this in our GraphQLChannel:

def subscribed
    @subscription_ids = []
end

This would start each instance on a blank slate of connections, so if we're scaling and running 3-4 instances, they're all pushing to the same redis. One instance may pick up a subscriptions.trigger(...) from puma on another instance, but doesn't have the gid in its @subscription_ids, because its not being shared.

Perhaps @subscription_ids should be pointing to redis to maintain a shared list of subscribed clients?

@rmosolgo
Copy link
Owner

My intention on @subscription_ids is for that to be a per-instance list of subscribers to that instance. So it works like:

  • subscriptions.trigger is called in one Ruby process, it calls .broadcast to notify other processes of the trigger:
    # An event was triggered; Push the data over ActionCable.
    # Subscribers will re-evaluate locally.
    def execute_all(event, object)
    stream = stream_event_name(event)
    message = @serializer.dump(object)
    @action_cable.server.broadcast(stream, message)
    end
  • meanwhile, any ActionCable process that received a subscription should have setup a stream_from for that subscriber:
    channel.stream_from(stream_event_name(initial_event), coder: @action_cable_coder) do |message|
  • So, when a process receives a notification from a trigger, it runs that stream_from block and generates a payload. Then it broadcasts again, using the specific subscription ID, to update the client:
    @action_cable.server.broadcast(stream_subscription_name(subscription_id), payload)

I hope that helps!

@Amnesthesia
Copy link
Author

@rmosolgo Thank you for taking the time out of the day to explain this a bit, appreciate it! :) We still haven't worked this issue out. We realized we weren't running ActionCable standalone in production, and thought perhaps this would be solved when we move ActionCable to be its own standalone service, but (I presume) because we also scale this out, this "issue" still persists.

We don't really know to what extent its an issue, or if it is an issue at all — e.g, are we getting a certain percentage of lost websockets pushes? or are these errors completely harmless?

So, when a process receives a notification from a trigger, it runs that stream_from block and generates a payload. Then it broadcasts again, using the specific subscription ID, to update the client

In this sense, should an ActionCable process that does not have the subscription_id in its list of subscribers simply not pick this up? Do you see a way that it would happen in a multi-ActionCable setup that ActionCable processes routinely try to handle messages meant for subscriber_ids that they dont have?

@Amnesthesia
Copy link
Author

Amnesthesia commented Apr 4, 2024

@rmosolgo We've moved away from scaling actioncable horizontally and have instead scaled it up, increasing the spec of the machine significantly and only running one machine. We are still seeing these errors every day and we're quite lost at where they come from

We've reduced our identified_by attributes as well to a single one, identified_by :_strategy where _strategy is a custom class managing the JWT token and current user

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants