Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Sagemaker Training Job Support #1392

Open
mathephysicist opened this issue Jul 29, 2020 · 2 comments
Open

Feature Request: Sagemaker Training Job Support #1392

mathephysicist opened this issue Jul 29, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@mathephysicist
Copy link

In order to drive usability of the platform I propose that we make all scripts have sagemaker support.

  1. Rename args, or allow the correspond args to be very easy to wrap around.
  2. Reduce need for 'store_true' args.
  3. Have an argument which allows you to control where artifacts are stored, or datasets are grabbed.
@szha szha added the enhancement New feature or request label Jul 29, 2020
@zhreshold
Copy link
Member

We are refactoring the training scripts with estimator style fitting functions, which is a better fit for sagemaker.

@mathephysicist
Copy link
Author

Sorry for the delay, I think that works. Just to make sure we cover our basis please look at the following in order to make Amazon SageMaker training easy:
• Create a dataset-root/model-root location so we choose where we grab datasets from (rather than assuming they are in a certain location.
• Create a flag to specify where we save model-artifacts and checkpoints
• Ensure training works on SageMaker Training (sometimes there are problems with Horovod and downloading pre-trained models).
• Do one of the 2 following solutions:

  1. Modify the arguments so that SageMaker training can work more easily.
    a. Modify, input arguments so they align with the arguments SageMaker needs for training.
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])

    And some sort of checkpoints-save flag where we can choose where to save checkpoints.

  2. Probably the better solution (for integration into more platforms)
    a. Create a “SageMaker Train” script which wraps our scripts and provides the functionality necessary for 1) (primarily, I list of environ variables, that it goes ahead and grabs.)

Let me know if I need to add more clarity. I'd be more than willing to help write the blog once we have worked out those kinks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants