Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP for launching batch runs on AWS directly from COVIDScenarioPipeline #223

Closed
wants to merge 12 commits into from

Conversation

jwills
Copy link
Collaborator

@jwills jwills commented Apr 20, 2020

Going to be iterating on this for a bit, but posting it for feedback from folks now.

import yaml

@click.command()
@click.option("-p", "--job-prefix", type=str, required=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: intelligent defaults to reduce the number of required params would be nice. e.g., job-prefix is name in config file plus a uniqifer, num-jobs is calculated based on # of scenarios/# of sims with some cap, etc.

# Update and save the config file with the number of sims to run
print("Updating config file %s to run %d simulations..." % (config_file, sims_per_job))
config = open(config_file).read()
config = re.sub("nsimulations: \d+", "nsimulations: %d" % sims_per_job, config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would use pyyaml or the utils.config module to read-modify-write versus regex, this seems error-prone

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I did at first, but the output was a total re-ordering of the entries in the config file, which seemed not great; I wonder if there's a way to keep the ordering of the entries consistent on the read-edit-write cycle?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this thread, you need to set sort_keys=False on the dump. Trying it locally, this preserves key order in the file, but strips comments and whitespace.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we would be okay with that, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think it's fine

{"name": "DVC_OUTPUTS", "value": " ".join(dvc_outputs)},
{"name": "S3_RESULTS_PATH", "value": results_path}
]
s3_cp_run_script = "aws s3 cp s3://%s/%s $PWD/run-covid-pipeline" % (s3_input_bucket, runner_script_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside, don't know if you know about fstrings in python 3.6+, they make subbing variables much easier/neater)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really; will take a look!

Comment on lines 23 to 26
@click.option("-i", "--s3-input-bucket", "s3_input_bucket", type=str, default="idd-input-data-sets")
@click.option("-o", "--s3-output-bucket", "s3_output_bucket", type=str, default="idd-pipeline-results")
@click.option("-d", "--job-definition", "batch_job_definition", type=str, default="Batch-CovidPipeline-Job")
@click.option("-q", "--job-queue", "batch_job_queue", type=str, default="Batch-CovidPipeline")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for these params, i think you want show_default=True set so when you do a --help you'll see what the default value is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also missing helpstrings :)

@samshah
Copy link
Collaborator

samshah commented Apr 20, 2020

nice! this looks really good -- thanks for doing it! i only have minor/trivial comments

for output in "${DVC_OUTPUTS_ARRAY[@]}"
do
"Saving output $output"
tar cv --use-compress-program=pbzip2 -f $output.tar.bz2 $output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is compression required? We took that route initially but found that just a sync was sufficient and faster overall.

s3_cp_run_script = f"aws s3 cp s3://{s3_input_bucket}/{runner_script_name} $PWD/run-covid-pipeline"
command = ["sh", "-c", f"{s3_cp_run_script}; /bin/bash $PWD/run-covid-pipeline"]
container_overrides = {
'vcpus': 72,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should one allow an override on that? If the advanced user wants a different configuration?

@jkamins7
Copy link
Contributor

can we change this to use dev instead of dataseed? trying to replace dataseed as our slightly less stable bracnh

@jkamins7 jkamins7 changed the base branch from dataseed to dev April 22, 2020 00:36
@jwills jwills closed this May 26, 2020
@jwills jwills deleted the dataseed_batch branch May 31, 2020 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants