Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Use Katib SDK for E2E Tests #2075

Merged
merged 15 commits into from Jan 16, 2023

Conversation

andreyvelich
Copy link
Member

Related: #2024, #2044

I used Katib SDK to run the E2E test.
Also I made the following changes in the SDK:

  1. I used the unify style for our Python APIs and Exceptions, so users can easily understand it. We need to create script to automatically generate docs from our Katib Client. Currently, this doc is outdated: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/docs/KatibClient.md.
    Any existing tools that we can use ?

  2. I introduced the following changes in APIs:

    • get_experiment_status -> get_experiment_conditions to return Experiment conditions.
    • is_experiment_created the new API to check Experiment status.
    • is_experiment_running the new API to check Experiment status.
    • is_experiment_restarting the new API to check Experiment status.
    • is_experiment_succeeded the new API to check Experiment status.
    • is_experiment_failed the new API to check Experiment status.
    • wait_for_experiment_condition the new API to wait until Experiment reaches condition. (Similar to Training Operator.)
    • edit_experiment_budget the new API to modify Experiment budget in-place.
    • get_suggestion split between get_suggestion and list_suggestions similar to list_experiments.
    • get_trial to get Trial CR object.
    • get_optimal_hyperparameters returns V1beta1OptimalTrial object.
  3. Please let me know what do you think about API style, can we improve it better ?

It would be great if you could start reviewing this.
Also, if you think that change is too big for upcoming release, we can postpone it.

TODO: I need to update examples that use Katib SDK. Will do it during the week.

/hold

cc @gaocegege @johnugeorge @anencore94 @tenzen-y @kubeflow/wg-training-leads

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@andreyvelich
Copy link
Member Author

I've changed tune and train and CMA-ES SDK examples.
We might also update the NAS example in the future.

@andreyvelich andreyvelich changed the title [WIP] [SDK] Use Katib SDK for E2E Tests [SDK] Use Katib SDK for E2E Tests Jan 6, 2023
@tenzen-y
Copy link
Member

tenzen-y commented Jan 6, 2023

@andreyvelich Thanks for improving our E2E.
I'm going to review this PR now.

Also, it would be good to modify the E2E test to verify Python SDK operations in various Python versions, as discussed in #2057 (comment).

Although, we can follow up on that in another PR.

max_trial_count = experiment.spec.max_trial_count + 1
parallel_trial_count = experiment.spec.parallel_trial_count + 1
print(
f"Restarting Experiment {exp_namespace}/{exp_name} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this Restarting experiment for random?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)

# Describe the Experiment and Suggestion.
print(os.popen(f"kubectl describe experiment {exp_name} -n {exp_namespace}").read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add verbose as an optional parameter that can be enabled/disabled to show verbose logs? (Enabled by default)

This will help in tests like conformance tests where we are only interested in final experiment success/failure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me enable it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge I enabled logging for Katib E2Es, what do you think about it ?

)
)
try:
response = utils.FakeResponse(thread.get(constants.APISERVER_TIMEOUT))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think It might help to make configurable TIMEOUT. So, can we add duration until timeout to function args?
WDYT?

    def get_experiment(
        self, name: str, namespace: str = utils.get_default_target_namespace(), duration,
    ):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me update it.

"""

try:
self.api_instance.delete_namespaced_custom_object(
self.custom_api.delete_namespaced_custom_object(
constants.KUBEFLOW_GROUP,
constants.KATIB_VERSION,
namespace,
constants.EXPERIMENT_PLURAL,
name,
body=client.V1DeleteOptions(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make configurable client.V1DeleteOptions()? It helps users to set grace_period_seconds and more.
WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can allow it.

args = parser.parse_args()

if args.verbose == "False":
logging.getLogger().setLevel(logging.WARNING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, set default to Debug and change to INFO when verbose is False?

Also, set start/end logs to INFO and rest all logs to DEBUG. So, if verbose is not set, user will just see logs start and the final experiment status.

try:
run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more clarity, keepSucceeded instead of completed

Comment on lines 209 to 217
# Describe the Experiment and Suggestion.
logging.info(
os.popen(f"kubectl describe experiment {exp_name} -n {exp_namespace}").read()
)
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
logging.info(
os.popen(f"kubectl describe suggestion {exp_name} -n {exp_namespace}").read()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there intentions to use the kubectl command, not Python SDK?
It would be good to use Python SDK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I guess, we used kubectl describe for better visibility. Let me just print the Experiment and Suggestion:

logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))

Comment on lines 278 to 293
try:
run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}")
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
# Delete the Experiment.
katib_client.delete_experiment(exp_name, exp_namespace)
except Exception as e:
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}")
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
# Delete the Experiment and raise an Exception.
katib_client.delete_experiment(exp_name, exp_namespace)
raise e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need to delete experiment regardless of pass/fail, I think using finally would be nice

Suggested change
try:
run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}")
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
# Delete the Experiment.
katib_client.delete_experiment(exp_name, exp_namespace)
except Exception as e:
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}")
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
# Delete the Experiment and raise an Exception.
katib_client.delete_experiment(exp_name, exp_namespace)
raise e
logging.info("---------------------------------------------------------------")
try:
run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace)
logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}")
except Exception as e:
logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}")
raise e
finally:
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
# Delete the Experiment and raise an Exception.
katib_client.delete_experiment(exp_name, exp_namespace)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Good suggestion.

Comment on lines 228 to 232
parser.add_argument(
"--verbose",
type=str,
default=True,
choices=("True", "False"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using action=store_true rather than using type=str and choices=("True", "False") for booelan type?
Then we could change the following 238th line from if args.verbose == "False": to if args.verbose:

Comment on lines 141 to 152
PVCs = client.CoreV1Api().list_namespaced_persistent_volume_claim(
exp_namespace
)
is_deleted = 1
for i in PVCs.items:
if i.metadata.name == resource_name:
is_deleted = 0
if is_deleted == 1:
raise Exception(
"PVC is deleted for FromVolume resume policy. "
f"Alive PVCs: {[i.metadata.name for i in PVCs.items]}."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change this lines to use read_namespaced_persistent_volume_claim(resource_name, exp_namespace) and check whether this method raise the 404 not found exception or not. And then raise our Exception ?

Since I think i have to read this source code more carefully to understand what this source code wants to test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I thought we have only list API 😄
I am not sure why Kubernetes API Python client named get API for Customer Resources and read API for Core Kubernetes Resources.

@andreyvelich
Copy link
Member Author

Thanks for your review @tenzen-y @anencore94 @johnugeorge @terrytangyuan.
I addressed your suggestions, please check it.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Looks great! Thanks for your tremendous effort!
/lgtm

@johnugeorge
Copy link
Member

LGTM

Thanks @andreyvelich

@andreyvelich
Copy link
Member Author

Thanks everyone for the review!
/hold cancel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants