Feature Request: Sagemaker Training Job Support #1392

mathephysicist · 2020-07-29T21:56:26Z

In order to drive usability of the platform I propose that we make all scripts have sagemaker support.

Rename args, or allow the correspond args to be very easy to wrap around.
Reduce need for 'store_true' args.
Have an argument which allows you to control where artifacts are stored, or datasets are grabbed.

zhreshold · 2020-08-12T01:48:45Z

We are refactoring the training scripts with estimator style fitting functions, which is a better fit for sagemaker.

mathephysicist · 2020-08-24T13:16:14Z

Sorry for the delay, I think that works. Just to make sure we cover our basis please look at the following in order to make Amazon SageMaker training easy:
• Create a dataset-root/model-root location so we choose where we grab datasets from (rather than assuming they are in a certain location.
• Create a flag to specify where we save model-artifacts and checkpoints
• Ensure training works on SageMaker Training (sometimes there are problems with Horovod and downloading pre-trained models).
• Do one of the 2 following solutions:

Modify the arguments so that SageMaker training can work more easily.
a. Modify, input arguments so they align with the arguments SageMaker needs for training.
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])

And some sort of checkpoints-save flag where we can choose where to save checkpoints.
Probably the better solution (for integration into more platforms)
a. Create a “SageMaker Train” script which wraps our scripts and provides the functionality necessary for 1) (primarily, I list of environ variables, that it goes ahead and grabs.)

Let me know if I need to add more clarity. I'd be more than willing to help write the blog once we have worked out those kinks.

szha added the enhancement New feature or request label Jul 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Sagemaker Training Job Support #1392

Feature Request: Sagemaker Training Job Support #1392

mathephysicist commented Jul 29, 2020

zhreshold commented Aug 12, 2020

mathephysicist commented Aug 24, 2020

Feature Request: Sagemaker Training Job Support #1392

Feature Request: Sagemaker Training Job Support #1392

Comments

mathephysicist commented Jul 29, 2020

zhreshold commented Aug 12, 2020

mathephysicist commented Aug 24, 2020