Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes cannot connect to kernel through websocket in replication availability mode #1373

Open
edwardzjl opened this issue Mar 4, 2024 · 0 comments
Labels

Comments

@edwardzjl
Copy link

Description

I deployed enterprise gateway on k8s in replication availability mode (3 replicas) with file session persistence, and ensured that sessionAffinity is on for enterprise gateway service.

I have another app communicating with EG. And that app performs CRUD and establishes connection to kernels throught EG.

The issue I have is, when I perform a GET kernel to EG, EG will automatically load the saved sessions, so whichever pod I was routed to, the state is always up to date. Here is the related part of EG's log:

[D 2024-03-04 08:46:11.428 EnterpriseGatewayApp] Loading saved session(s) from /data/kernel_sessions/573e4281-0442-421b-8a25-760549854902.json
[D 2024-03-04 08:46:11.481 EnterpriseGatewayApp] Connecting to: tcp://$IP:$PORT
[D 2024-03-04 08:46:11.481 EnterpriseGatewayApp] Connecting to: tcp://$IP:$PORT
[I 240304 08:46:11 web:2271] 200 GET /api/kernels/573e4281-0442-421b-8a25-760549854902 ($IP) 88.69ms

However, if I perform a websocket connect, EG does not load the sessions, results in randomly websocket 404 (if I was not routed to the correct pod):

[D 2024-03-04 03:14:42.377 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels
[W 2024-03-04 03:14:42.382 EnterpriseGatewayApp] No session ID specified
[W 240304 03:14:42 web:1796] 404 GET /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels ($IP): Kernel does not exist: 6bb19c3b-26f6-4b4d-bfc9-be831a9e648f
[W 240304 03:14:42 web:2271] 404 GET /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels ($IP) 6.49ms

And the same for DELETE kernel:

[D 2024-03-04 09:13:01.411 EnterpriseGatewayApp] activity on d09839bb-10b6-42ca-bff8-6051d046d709: execute_input
[D 2024-03-04 09:13:02.318 EnterpriseGatewayApp] activity on d09839bb-10b6-42ca-bff8-6051d046d709: stream
...
[W 240304 09:13:02 web:1796] 404 DELETE /api/kernels/62eca2ab-6f9e-4471-94bb-7f51e83ecc60 ($IP): Kernel does not exist: 62eca2ab-6f9e-4471-94bb-7f51e83ecc60
[W 240304 09:13:02 web:2271] 404 DELETE /api/kernels/62eca2ab-6f9e-4471-94bb-7f51e83ecc60 ($IP) 907.86ms

So that I think here are 2 problems:

  • The kubernetes's sessionAffinity is somehow not working (which I need to dig deeper if I have time)
    • However, even if I can make sessionAffinity work in my cluster, there still might be edge cases when sessionAffinity timeouts.
  • The behaviour of GET methods differs from other methods

I also noticed that there's a brief note about 'manual reconnect' in the document, but I haven't figured out how to do it in my setup.

I'm new to EG, I'm not sure if it's a real bug or it's by design. But I think if GET loads saved sessions automatically, maybe other endpoints should too.

Context

  • Jupyter Enterprise Gateway version: 3.2.2
Troubleshoot Output

I added 3 envs and 1 volume to deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: enterprise-gateway
  namespace: jupyter
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: enterprise-gateway
          env:
          ...
          - name: EG_AVAILABILITY_MODE
            value: replication
          # 2 envs related to session persistence
          - name: EG_KERNEL_SESSION_PERSISTENCE
            value: "True"
          - name: EG_PERSISTENCE_ROOT
            value: /data
          volumeMounts:
            - name: persistence-root
              mountPath: /data
              readOnly: false
      volumes:
        - name: persistence-root
          persistentVolumeClaim:
            claimName: persistence-root

Created a pvc to store session data:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: persistence-root
  namespace: jupyter
spec:
  storageClassName: nfs-client
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

And I checked that sessionAffinity is on in the enterprise-gateway service:

apiVersion: v1
kind: Service
metadata:
  name: enterprise-gateway
  namespace: jupyter
spec:
  ...
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant