[Feature] Support for configure magic on Spark Python Kubernetes Kernels (WIP) #1105

rahul26goyal · 2022-06-07T17:22:02Z

Problem Statement

With JEG running on a remote machine and handling the kernel life cycle, Notebook users can not longer change the Kernels specs / properties locally which would update the configuration with which Spark kernel comes up. There are various use cases where users want to play around and experiment with different spark configuration to arrive at the final configs which best suit their workload. These configs also might vary from one notebook to another based on the workload the notebook is doing. JEG is also used as we multi-tenant service where each user might want to tweak the kernel based on his/ her scenario.
Thus, there is a need for users to be able to update the kernel / spark properties at runtime from the notebook.

Feature Description

The changes proposed in this PR are to add support for a well known magic %%configure -f {} which allows Notebook users to change the spark properties at runtime without having to create / update any kernel spec file. This would allow users to change spark driver, executor resources (like cores, memory), enable / disable spark configuration etc.

Example: The below snipped can be copied into a notebook cell to update the various spark properties associated with the current kernel.

%%configure -f 
{
  "driverMemory": "3G",
  "driverCores" : "2",
  "executorMemory" : "3G",
  "executorCores" : "2",
  "numExecutors" : 5,
  "conf" : {
      "spark.kubernetes.driver.label.test": "test-label"
  }
}

Implementation Details

The below are the changes made at the high level:

I have introduced a new API on JEG POST api/configure/<kernel_id> which accepts a payload similar to create kernel API. This API currently support updating the ["KERNEL_EXTRA_SPARK_OPTS", "KERNEL_LAUNCH_TIMEOUT"] env variables.
The above API tries to restart the same Kernel with the updated configuration. This is done because we want to keep the kernel_id same and want to give a smooth end user experience.
Once the old kernel goes away and a replacement comes up, we also need to refresh the ZMQ sockets to establish the connection with the new kernel so that existing active websocket connection from notebook / jupyterlab UI clients can continue to work. There hooks introduced to handle the same.
Further, in order to complete the usual Jupyter Kernel messaging handshake, we fire the missing zmq messages from JEG to the websocket clients. Example: In order to mark the completion on the current cell, we need to send the exec_reply message and to mark the kernel idle, we need to kernel status=idle messages etc . These messages are pre-generated on the kernel and sent to JEG while making the API call to refresh the kernel.

I will update more details about the changes and add some diagrams.

Testing

Basic sanity testing done.

Note

Opening this PR for some early feedback and discussion on the changes.

for more information, see https://pre-commit.ci

kevin-bates · 2022-06-07T22:40:33Z

etc/kernel-launchers/python/scripts/configure_magic.py

+            response = requests.post(
+                self.update_kernel_url,
+                data=json.dumps(payload_dict, default=str),
+                headers=self.headers,
+                verify=False,
+            )


We should address this alert somehow.

sure...we can discuss the approach to resolve this.

etc/kernel-launchers/python/scripts/configure_magic.py

+            logger.debug(f"Payload to refresh: {magic_payload}")
+            result = self.update_kernel(magic_payload)
+            return result
+        return "Done"


etc/kernel-launchers/python/scripts/configure_magic.py

+        else:
+            logger.error(f"Either key or value is not defined. {env_key}, {env_value}")
+
+    def update_kernel(self, payload_dict):


etc/kernel-launchers/python/scripts/configure_magic.py

@@ -0,0 +1,296 @@
+import base64
+import json


etc/kernel-launchers/python/scripts/configure_magic.py

@@ -0,0 +1,296 @@
+import base64


kevin-bates

This is really interesting @rahul26goyal - thank you. I'm not sure we should hold the 3.0 release for this as this seems additive. Most of the comments are little things.

One thing I'm not sure about is that this only applies to Python kernels, and only those configured for Spark (at the moment, although it could be more general).

I'd also like to look into incorporating the RemoteZMQChannelsHandler in a general way. I haven't given it a real close look, but do you think that could be useful even outside this particular "configure/restart" feature?

kevin-bates · 2022-06-07T21:58:46Z

enterprise_gateway/services/kernels/handlers.py

+        payload = self.get_json_body()
+        self.log.debug(f"Request payload: {payload}")
+        if payload is None:
+            self.log.info("Empty payload in the request body.")


These info messages aren't necessary since the message returned to the client will indicate where it came from.

kevin-bates · 2022-06-07T21:59:49Z

enterprise_gateway/services/kernels/handlers.py

+            self.log.info("Empty payload in the request body.")
+            self.finish(
+                json.dumps(
+                    {"message": "Empty payload received. No operation performed on the Kernel."},


Suggested change

{"message": "Empty payload received. No operation performed on the Kernel."},

{"message": f"Empty payload received. No operation performed on kernel: {kernel_id}"},

kevin-bates · 2022-06-07T21:59:49Z

enterprise_gateway/services/kernels/handlers.py

+            self.log.info("Empty payload in the request body.")
+            self.finish(
+                json.dumps(
+                    {"message": "Empty payload received. No operation performed on the Kernel."},


Suggested change

{"message": "Empty payload received. No operation performed on the Kernel."},

{"message": f"Empty payload received. No operation performed on kernel: {kernel_id}"},

kevin-bates · 2022-06-07T22:03:20Z

enterprise_gateway/services/kernels/handlers.py

+                "An existing restart request is still in progress. Skipping this request."
+            )
+            raise web.HTTPError(
+                400, "Duplicate Kernel update request received for Id: %s." % kernel_id


Suggested change

400, "Duplicate Kernel update request received for Id: %s." % kernel_id

400, "Duplicate configure kernel request received for kernel: %s." % kernel_id

kevin-bates · 2022-06-07T22:03:20Z

enterprise_gateway/services/kernels/handlers.py

+                "An existing restart request is still in progress. Skipping this request."
+            )
+            raise web.HTTPError(
+                400, "Duplicate Kernel update request received for Id: %s." % kernel_id


Suggested change

400, "Duplicate Kernel update request received for Id: %s." % kernel_id

400, "Duplicate configure kernel request received for kernel: %s." % kernel_id

kevin-bates · 2022-06-07T22:40:33Z

etc/kernel-launchers/python/scripts/configure_magic.py

+            response = requests.post(
+                self.update_kernel_url,
+                data=json.dumps(payload_dict, default=str),
+                headers=self.headers,
+                verify=False,
+            )


We should address this alert somehow.

enterprise_gateway/services/kernels/remotemanager.py

rahul26goyal · 2022-06-08T19:20:17Z

@kevin-bates :
I am help is deciding the right terminology for the operation we are performing on the kernel using this new configure API:

do we call it " refreshing kernel" or "re-configuring kernel" ?
do we change the api to api/refresh/<kernel_id> and call this "refreshing kerne" operation?

we need use this term in both logs and response messages.
pls give this some thought

for more information, see https://pre-commit.ci

etc/kernel-launchers/python/scripts/configure_magic.py

+        except ValueError as ve:
+            logger.exception(f"Could not parse JSON object from input {cell}: error: {ve}.")
+            return ConfigureMagic.INVALID_JSON_PAYLOAD
+        except JSONDecodeError as jde:


for more information, see https://pre-commit.ci

kevin-bates · 2022-06-13T17:26:47Z

@kevin-bates : I am help is deciding the right terminology for the operation we are performing on the kernel using this new configure API:

do we call it " refreshing kernel" or "re-configuring kernel" ?

do we change the api to api/refresh/<kernel_id> and call this "refreshing kerne" operation?

we need use this term in both logs and response messages. pls give this some thought

I guess refresh seems a little easier to understand than reconfigure. Does this imply the magic name would change to %%refresh and does that conflict with existing magics? I think having the terminology match the magic name would be helpful.

I would also like to see the endpoint be under api/kernels rather than a sibling to api/kernels. Do you agree? If not, could you please help me understand why not? Is adding an endpoint under api/kernels violating some kind of convention?

kevin-bates · 2022-07-18T17:03:40Z

Hi @rahul26goyal - what is the status of this PR since it's been about 6 weeks since its last update and it seems there are a few things still to work out?

Supoprt for configure magic on Spark Python Kubernetes Native Kernels

2c1cfe2

rahul26goyal requested a review from kevin-bates June 7, 2022 17:22

[pre-commit.ci] auto fixes from pre-commit.com hooks

6780c5f

for more information, see https://pre-commit.ci

rahul26goyal requested review from lresende and davidbrochart June 7, 2022 17:22

github-advanced-security bot found potential problems Jun 7, 2022

View reviewed changes

kevin-bates reviewed Jun 7, 2022

View reviewed changes

rahul26goyal added enhancement spark labels Jun 8, 2022

rahul26goyal and others added 2 commits June 9, 2022 01:07

Addressing review comments

513ccc4

[pre-commit.ci] auto fixes from pre-commit.com hooks

fadb724

for more information, see https://pre-commit.ci

github-advanced-security bot found potential problems Jun 8, 2022

View reviewed changes

rahul26goyal and others added 4 commits June 9, 2022 01:23

Addressing review comments

994fc08

[pre-commit.ci] auto fixes from pre-commit.com hooks

a49308d

for more information, see https://pre-commit.ci

Addressing review comments

34672a4

[pre-commit.ci] auto fixes from pre-commit.com hooks

46eef00

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support for configure magic on Spark Python Kubernetes Kernels (WIP) #1105

[Feature] Support for configure magic on Spark Python Kubernetes Kernels (WIP) #1105

rahul26goyal commented Jun 7, 2022 •

edited

kevin-bates Jun 7, 2022

rahul26goyal Jun 8, 2022

kevin-bates left a comment

kevin-bates Jun 7, 2022

kevin-bates Jun 7, 2022

kevin-bates Jun 7, 2022

kevin-bates Jun 7, 2022

kevin-bates Jun 7, 2022

kevin-bates Jun 7, 2022

rahul26goyal commented Jun 8, 2022

kevin-bates commented Jun 13, 2022

kevin-bates commented Jul 18, 2022

	{"message": "Empty payload received. No operation performed on the Kernel."},
	{"message": f"Empty payload received. No operation performed on kernel: {kernel_id}"},

	400, "Duplicate Kernel update request received for Id: %s." % kernel_id
	400, "Duplicate configure kernel request received for kernel: %s." % kernel_id

[Feature] Support for configure magic on Spark Python Kubernetes Kernels (WIP) #1105

Are you sure you want to change the base?

[Feature] Support for configure magic on Spark Python Kubernetes Kernels (WIP) #1105

Conversation

rahul26goyal commented Jun 7, 2022 • edited

Problem Statement

Feature Description

Implementation Details

Testing

Note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahul26goyal commented Jun 8, 2022

kevin-bates commented Jun 13, 2022

kevin-bates commented Jul 18, 2022

rahul26goyal commented Jun 7, 2022 •

edited