Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing, training improvements. Reproducing previous implementation metrics. Augmentation. #10

Merged
merged 301 commits into from
Sep 1, 2023

Conversation

josejuanmartinez
Copy link

@josejuanmartinez josejuanmartinez commented Jul 21, 2023

  • Fixes I did to the preprocessing and training steps.
  • Found the optimal hyperparams that leaves us close to previous implementation metrics (~0.61)
  • New augmentation module

On nick instance:

  • a total time of 1h for preprocessing (compared to 4h before)
  • 130h (5.5 days) for training (in here the reference was 16 days, but not sure if the number of epochs is the same, I need to double check)

On wellcome's g5.12xlarge instance:

  • a total time of 45min for preprocessing
  • 27h for training

Checks:

  • Test that nothing else is broken
  • Create unit test
  • Fix legacy non-working unit test
  • Update documentation
  • black, ruff

Summary of fixes
Preprocessing

  • Refactors all preprocessing of the mesh dataset so that it runs in parallel with a 100% utilization of CPUs.
  • Transforms json to jsonl to allow parallelism. Uses native datasets.load("json") function, made for parallel jsonl.
  • Removes IO operations. Saving the dataset makes all the preprocessing to go sequential to write, which adds a lot of latency. And then, you need to load it again in train. Instead, train just calls preprocessing which returns the data without any I/O operation.
  • Adds num_proc=os.cpu_count()

Training

{'eval_loss': 0.0012127996888011694, 
'eval_micro_avg': {'precision': 0.5457992998833139, 'recall': 0.6772033540447608, 'f1-score': 0.6044420514201103, 'support': 99462}, 
'eval_macro_avg': {'precision': 0.1482883180673584, 'recall': 0.16534744144274516, 'f1-score': 0.1469639216235595, 'support': 99462}, 
'eval_weighted_avg': {'precision': 0.6706617304030531, 'recall': 0.6772033540447608, 'f1-score': 0.6601941544336467, 'support': 99462}, 
'eval_samples_avg': {'precision': 0.5605420571233047, 'recall': 0.679337366894562, 'f1-score': 0.5909342069582941, 'support': 99462}, 
'eval_runtime': 1713.0151, 
'eval_samples_per_second': 5.838, 
'eval_steps_per_second': 1.459, 
}

Augmentation
Uses OpenAI in either sequential or parallel order, given a series of tags. The augmentations were shown to Wellcome and they approved them.

@josejuanmartinez josejuanmartinez changed the base branch from main to 3-improve-trainer July 21, 2023 15:56
@josejuanmartinez josejuanmartinez marked this pull request as ready for review September 1, 2023 13:45
@josejuanmartinez josejuanmartinez changed the title Preprocessing and training improvements Preprocessing, training improvements. Reproducing previous implementation metrics. Augmentation. Sep 1, 2023
@josejuanmartinez josejuanmartinez merged commit 8ae9f7a into main Sep 1, 2023
3 checks passed
@josejuanmartinez josejuanmartinez deleted the 3-improve-trainer-juan branch September 1, 2023 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants