[SDK] Use Katib SDK for E2E Tests #2075

andreyvelich · 2023-01-04T19:02:41Z

Related: #2024, #2044

I used Katib SDK to run the E2E test.
Also I made the following changes in the SDK:

I used the unify style for our Python APIs and Exceptions, so users can easily understand it. We need to create script to automatically generate docs from our Katib Client. Currently, this doc is outdated: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/docs/KatibClient.md.
Any existing tools that we can use ?
I introduced the following changes in APIs:
- get_experiment_status -> get_experiment_conditions to return Experiment conditions.
- is_experiment_created the new API to check Experiment status.
- is_experiment_running the new API to check Experiment status.
- is_experiment_restarting the new API to check Experiment status.
- is_experiment_succeeded the new API to check Experiment status.
- is_experiment_failed the new API to check Experiment status.
- wait_for_experiment_condition the new API to wait until Experiment reaches condition. (Similar to Training Operator.)
- edit_experiment_budget the new API to modify Experiment budget in-place.
- get_suggestion split between get_suggestion and list_suggestions similar to list_experiments.
- get_trial to get Trial CR object.
- get_optimal_hyperparameters returns V1beta1OptimalTrial object.
Please let me know what do you think about API style, can we improve it better ?

It would be great if you could start reviewing this.
Also, if you think that change is too big for upcoming release, we can postpone it.

TODO: I need to update examples that use Katib SDK. Will do it during the week.

/hold

cc @gaocegege @johnugeorge @anencore94 @tenzen-y @kubeflow/wg-training-leads

sdk/python/v1beta1/kubeflow/katib/constants/constants.py

sdk/python/v1beta1/kubeflow/katib/utils/utils.py

review-notebook-app · 2023-01-06T15:05:17Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

andreyvelich · 2023-01-06T15:11:31Z

I've changed tune and train and CMA-ES SDK examples.
We might also update the NAS example in the future.

tenzen-y · 2023-01-06T15:31:20Z

@andreyvelich Thanks for improving our E2E.
I'm going to review this PR now.

Also, it would be good to modify the E2E test to verify Python SDK operations in various Python versions, as discussed in #2057 (comment).

Although, we can follow up on that in another PR.

johnugeorge · 2023-01-09T04:44:06Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+        max_trial_count = experiment.spec.max_trial_count + 1
+        parallel_trial_count = experiment.spec.parallel_trial_count + 1
+        print(
+            f"Restarting Experiment {exp_namespace}/{exp_name} "


Why is this Restarting experiment for random?

@johnugeorge We also test random search Experiment to test LongRunning Experiment.

johnugeorge · 2023-01-09T04:46:28Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+    verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)
+
+    # Describe the Experiment and Suggestion.
+    print(os.popen(f"kubectl describe experiment {exp_name} -n {exp_namespace}").read())


Can we add verbose as an optional parameter that can be enabled/disabled to show verbose logs? (Enabled by default)

This will help in tests like conformance tests where we are only interested in final experiment success/failure

Sure, let me enable it.

@johnugeorge I enabled logging for Katib E2Es, what do you think about it ?

tenzen-y · 2023-01-09T13:37:26Z

sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

-                    )
-                )
+        try:
+            response = utils.FakeResponse(thread.get(constants.APISERVER_TIMEOUT))


I think It might help to make configurable TIMEOUT. So, can we add duration until timeout to function args?
WDYT?

def get_experiment( self, name: str, namespace: str = utils.get_default_target_namespace(), duration, ):

Sure, let me update it.

tenzen-y · 2023-01-09T13:57:02Z

sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

        """

        try:
-            self.api_instance.delete_namespaced_custom_object(
+            self.custom_api.delete_namespaced_custom_object(
                constants.KUBEFLOW_GROUP,
                constants.KATIB_VERSION,
                namespace,
                constants.EXPERIMENT_PLURAL,
                name,
                body=client.V1DeleteOptions(),


Can we make configurable client.V1DeleteOptions()? It helps users to set grace_period_seconds and more.
WDYT?

Yes, I think we can allow it.

johnugeorge · 2023-01-09T17:35:47Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+    args = parser.parse_args()
+
+    if args.verbose == "False":
+        logging.getLogger().setLevel(logging.WARNING)


Instead, set default to Debug and change to INFO when verbose is False?

Also, set start/end logs to INFO and rest all logs to DEBUG. So, if verbose is not set, user will just see logs start and the final experiment status.

johnugeorge · 2023-01-09T17:37:07Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+    try:
+        run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace)
+        logging.info("---------------------------------------------------------------")
+        logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}")


For more clarity, keepSucceeded instead of completed

tenzen-y · 2023-01-09T19:17:55Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+    # Describe the Experiment and Suggestion.
+    logging.info(
+        os.popen(f"kubectl describe experiment {exp_name} -n {exp_namespace}").read()
+    )
+    logging.info("---------------------------------------------------------------")
+    logging.info("---------------------------------------------------------------")
+    logging.info(
+        os.popen(f"kubectl describe suggestion {exp_name} -n {exp_namespace}").read()
+    )


Are there intentions to use the kubectl command, not Python SDK?
It would be good to use Python SDK.

@tenzen-y I guess, we used kubectl describe for better visibility. Let me just print the Experiment and Suggestion:

logging.debug(katib_client.get_experiment(exp_name, exp_namespace)) logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))

anencore94 · 2023-01-10T15:48:38Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+    try:
+        run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace)
+        logging.info("---------------------------------------------------------------")
+        logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}")
+        logging.info("---------------------------------------------------------------")
+        logging.info("---------------------------------------------------------------")
+        # Delete the Experiment.
+        katib_client.delete_experiment(exp_name, exp_namespace)
+    except Exception as e:
+        logging.info("---------------------------------------------------------------")
+        logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}")
+        logging.info("---------------------------------------------------------------")
+        logging.info("---------------------------------------------------------------")
+        # Delete the Experiment and raise an Exception.
+        katib_client.delete_experiment(exp_name, exp_namespace)
+        raise e


Since we need to delete experiment regardless of pass/fail, I think using finally would be nice

Suggested change

try:

run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace)

logging.info("---------------------------------------------------------------")

logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}")

logging.info("---------------------------------------------------------------")

logging.info("---------------------------------------------------------------")

# Delete the Experiment.

katib_client.delete_experiment(exp_name, exp_namespace)

except Exception as e:

logging.info("---------------------------------------------------------------")

logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}")

logging.info("---------------------------------------------------------------")

logging.info("---------------------------------------------------------------")

# Delete the Experiment and raise an Exception.

katib_client.delete_experiment(exp_name, exp_namespace)

raise e

logging.info("---------------------------------------------------------------")

try:

run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace)

logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}")

except Exception as e:

logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}")

raise e

finally:

logging.info("---------------------------------------------------------------")

logging.info("---------------------------------------------------------------")

# Delete the Experiment and raise an Exception.

katib_client.delete_experiment(exp_name, exp_namespace)

Sure! Good suggestion.

anencore94 · 2023-01-10T15:55:04Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+    parser.add_argument(
+        "--verbose",
+        type=str,
+        default=True,
+        choices=("True", "False"),


How about using action=store_true rather than using type=str and choices=("True", "False") for booelan type?
Then we could change the following 238th line from if args.verbose == "False": to if args.verbose:

https://docs.python.org/3/library/argparse.html#action

anencore94 · 2023-01-10T16:02:28Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+            PVCs = client.CoreV1Api().list_namespaced_persistent_volume_claim(
+                exp_namespace
+            )
+            is_deleted = 1
+            for i in PVCs.items:
+                if i.metadata.name == resource_name:
+                    is_deleted = 0
+            if is_deleted == 1:
+                raise Exception(
+                    "PVC is deleted for FromVolume resume policy. "
+                    f"Alive PVCs: {[i.metadata.name for i in PVCs.items]}."
+                )


How about change this lines to use read_namespaced_persistent_volume_claim(resource_name, exp_namespace) and check whether this method raise the 404 not found exception or not. And then raise our Exception ?

Since I think i have to read this source code more carefully to understand what this source code wants to test.

Sure, I thought we have only list API 😄
I am not sure why Kubernetes API Python client named get API for Customer Resources and read API for Core Kubernetes Resources.

andreyvelich · 2023-01-11T13:03:35Z

Thanks for your review @tenzen-y @anencore94 @johnugeorge @terrytangyuan.
I addressed your suggestions, please check it.

google-oss-prow · 2023-01-11T13:44:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y

@andreyvelich Looks great! Thanks for your tremendous effort!
/lgtm

johnugeorge · 2023-01-12T17:13:11Z

LGTM

Thanks @andreyvelich

andreyvelich · 2023-01-16T16:40:01Z

Thanks everyone for the review!
/hold cancel

andreyvelich added 3 commits January 4, 2023 17:58

[SDK] Use Katib SDK for E2E tests

709c8ad

Fix pvc deletion

1abffb9

Add list_suggestions API

be11377

google-oss-prow bot added do-not-merge/work-in-progress do-not-merge/hold approved labels Jan 4, 2023

google-oss-prow bot requested review from anencore94, johnugeorge and tenzen-y January 4, 2023 19:02

google-oss-prow bot added the size/XXL label Jan 4, 2023

terrytangyuan reviewed Jan 4, 2023

View reviewed changes

sdk/python/v1beta1/kubeflow/katib/constants/constants.py Show resolved Hide resolved

sdk/python/v1beta1/kubeflow/katib/utils/utils.py Show resolved Hide resolved

andreyvelich added 5 commits January 5, 2023 14:23

Remove wait from edit Experiment function

5c0f944

Add shell to GitHub action

bfd0ceb

Add protobuf package to Katib SDK

80e4efa

Add Experiment Timeout to 40 min

f4187e3

Modify SDK Examples

edbd3a0

Fix example text

3f57005

andreyvelich changed the title ~~[WIP] [SDK] Use Katib SDK for E2E Tests~~ [SDK] Use Katib SDK for E2E Tests Jan 6, 2023

google-oss-prow bot removed the do-not-merge/work-in-progress label Jan 6, 2023

Change to custom_api

296c781

johnugeorge mentioned this pull request Jan 8, 2023

Conformance tests for Katib #2044

Closed

johnugeorge reviewed Jan 9, 2023

View reviewed changes

andreyvelich added 2 commits January 9, 2023 12:08

Enable verbose logging for Katib E2E

8f283ed

Use expected condition arg

98a12ca

tenzen-y reviewed Jan 9, 2023

View reviewed changes

johnugeorge reviewed Jan 9, 2023

View reviewed changes

tenzen-y reviewed Jan 9, 2023

View reviewed changes

anencore94 reviewed Jan 10, 2023

View reviewed changes

andreyvelich mentioned this pull request Jan 10, 2023

[SDK] Generate Docs for Katib Client #2081

Open

andreyvelich added 3 commits January 10, 2023 21:21

Add timeout and delete options

2a79a6f

Modify logging to debug

99821e0

Use read API to check resource status

a84c920

andreyvelich force-pushed the e2e-katib-sdk branch from 6d9afd3 to a84c920 Compare January 11, 2023 12:41

terrytangyuan approved these changes Jan 11, 2023

View reviewed changes

tenzen-y reviewed Jan 12, 2023

View reviewed changes

google-oss-prow bot assigned tenzen-y Jan 12, 2023

google-oss-prow bot added the lgtm label Jan 12, 2023

google-oss-prow bot removed the do-not-merge/hold label Jan 16, 2023

google-oss-prow bot merged commit 6bcbd25 into kubeflow:master Jan 16, 2023

andreyvelich deleted the e2e-katib-sdk branch January 16, 2023 16:41

andreyvelich mentioned this pull request Jan 16, 2023

Add support for entire kubeflow pipelines as trial target (in addition to containers) #1914

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Use Katib SDK for E2E Tests #2075

[SDK] Use Katib SDK for E2E Tests #2075

andreyvelich commented Jan 4, 2023

review-notebook-app bot commented Jan 6, 2023

andreyvelich commented Jan 6, 2023

tenzen-y commented Jan 6, 2023

johnugeorge Jan 9, 2023

andreyvelich Jan 9, 2023 •

edited

johnugeorge Jan 9, 2023

andreyvelich Jan 9, 2023

andreyvelich Jan 9, 2023

tenzen-y Jan 9, 2023

andreyvelich Jan 10, 2023

tenzen-y Jan 9, 2023

andreyvelich Jan 10, 2023

johnugeorge Jan 9, 2023 •

edited

johnugeorge Jan 9, 2023

tenzen-y Jan 9, 2023

andreyvelich Jan 11, 2023

anencore94 Jan 10, 2023

andreyvelich Jan 10, 2023

anencore94 Jan 10, 2023

anencore94 Jan 10, 2023

andreyvelich Jan 10, 2023 •

edited

andreyvelich commented Jan 11, 2023

google-oss-prow bot commented Jan 11, 2023

tenzen-y left a comment

johnugeorge commented Jan 12, 2023

andreyvelich commented Jan 16, 2023

[SDK] Use Katib SDK for E2E Tests #2075

[SDK] Use Katib SDK for E2E Tests #2075

Conversation

andreyvelich commented Jan 4, 2023

review-notebook-app bot commented Jan 6, 2023

andreyvelich commented Jan 6, 2023

tenzen-y commented Jan 6, 2023

Choose a reason for hiding this comment

andreyvelich Jan 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge Jan 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Jan 10, 2023 • edited

Choose a reason for hiding this comment

andreyvelich commented Jan 11, 2023

google-oss-prow bot commented Jan 11, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

johnugeorge commented Jan 12, 2023

andreyvelich commented Jan 16, 2023

andreyvelich Jan 9, 2023 •

edited

johnugeorge Jan 9, 2023 •

edited

andreyvelich Jan 10, 2023 •

edited