Preprocessing, training improvements. Reproducing previous implementation metrics. Augmentation. #10

josejuanmartinez · 2023-07-21T15:54:59Z

Fixes I did to the preprocessing and training steps.
Found the optimal hyperparams that leaves us close to previous implementation metrics (~0.61)
New augmentation module

On nick instance:

a total time of 1h for preprocessing (compared to 4h before)
130h (5.5 days) for training (in here the reference was 16 days, but not sure if the number of epochs is the same, I need to double check)

On wellcome's g5.12xlarge instance:

a total time of 45min for preprocessing
27h for training

Checks:

Summary of fixes
Preprocessing

Refactors all preprocessing of the mesh dataset so that it runs in parallel with a 100% utilization of CPUs.
Transforms json to jsonl to allow parallelism. Uses native datasets.load("json") function, made for parallel jsonl.
Removes IO operations. Saving the dataset makes all the preprocessing to go sequential to write, which adds a lot of latency. And then, you need to load it again in train. Instead, train just calls preprocessing which returns the data without any I/O operation.
Adds num_proc=os.cpu_count()

Training

Adds Sharding and IterativeDatasets to training. Source: Slow dataloading with big datasets issue persists huggingface/datasets#2252 (comment)
Increases batch_size
Adds accumulative steps during evaluation to avoid OOM

{'eval_loss': 0.0012127996888011694, 
'eval_micro_avg': {'precision': 0.5457992998833139, 'recall': 0.6772033540447608, 'f1-score': 0.6044420514201103, 'support': 99462}, 
'eval_macro_avg': {'precision': 0.1482883180673584, 'recall': 0.16534744144274516, 'f1-score': 0.1469639216235595, 'support': 99462}, 
'eval_weighted_avg': {'precision': 0.6706617304030531, 'recall': 0.6772033540447608, 'f1-score': 0.6601941544336467, 'support': 99462}, 
'eval_samples_avg': {'precision': 0.5605420571233047, 'recall': 0.679337366894562, 'f1-score': 0.5909342069582941, 'support': 99462}, 
'eval_runtime': 1713.0151, 
'eval_samples_per_second': 5.838, 
'eval_steps_per_second': 1.459, 
}

Augmentation
Uses OpenAI in either sequential or parallel order, given a series of tags. The augmentations were shown to Wellcome and they approved them.

This reverts commit 1660ef0.

Andrei Apostol and others added 26 commits July 13, 2023 18:50

add multilabel data collator for storage savings

8cf6b1f

change test size to 0.05 and add train script

edba653

make save strategy match

d5545b8

fix typo

20db0c3

create separate preprocessing command that saves to disk (wip)

28accc1

disable caching

1a6fdae

turn argument to option

c4bd743

revert separate preprocess command

1660ef0

Revert "revert separate preprocess command"

4c6152c

This reverts commit 1660ef0.

move cache disable inside preproc

8a8d52f

add timings

7e3c2d3

Speed improvements on preprocessing

5ba41fc

Adds benchmarking utility

1628362

Cleaning. Training using preprocessing output

669d8e2

Cleaning. Training using preprocessing output

9d41a3f

Cleaning. Training using preprocessing output

273c705

Improving some steps

a774e72

Improving some steps

f8a10df

Improving some steps

3c99d25

Improving some steps

f5af79b

Improving save_to_disk

256e07c

Improving save_to_disk

ee18b75

Removing save_to_disk

daf94fa

Adapts training to call preprocess

7e6a2b0

Adapts training to call preprocess

f804517

Adapts training to call preprocess

18fb506

josejuanmartinez changed the base branch from main to 3-improve-trainer July 21, 2023 15:56

Adapts training to call preprocess

60d05cc

josejuanmartinez requested review from ivyleavedtoadflax and nsorros July 21, 2023 16:05

Jose J. Martinez added 26 commits August 28, 2023 12:47

Adds id2label for augmentation

60c9fca

Adds dataset folder

eb2d0f3

Decodes id back into labels

9815027

Decodes id back into labels

56ef40c

Generates examples also for not-underrepresented

d771918

Generates examples also for not-underrepresented

99e1067

Generates examples also for not-underrepresented

fa17233

Adds more columns to preprocessing

c811f8a

Adds more columns to preprocessing

6d58ea9

Adds more columns to preprocessing

c75c36a

Adds more columns to preprocessing

2e0474e

Adds more columns to preprocessing

1c173e1

Prevents crashes

726837f

Better hyperparams

60a3ec9

Check if fixes tests

4f70bbc

Tries to fix torch-cpu recent issue

8b080eb

Refactors augmentation

52231e0

Refactors augmentation

69b56ea

Refactors augmentation

77761be

Fixes tests

d6247db

Fixes black

8298b5a

Fixes black

6a440fa

Fixes ruff

37c9a2b

Fixes ruff

3a8fd48

Fixes black

06c5479

Fixes black

e1fbe2b

josejuanmartinez marked this pull request as ready for review September 1, 2023 13:45

josejuanmartinez changed the title ~~Preprocessing and training improvements~~ Preprocessing, training improvements. Reproducing previous implementation metrics. Augmentation. Sep 1, 2023

josejuanmartinez merged commit 8ae9f7a into main Sep 1, 2023
3 checks passed

josejuanmartinez deleted the 3-improve-trainer-juan branch September 1, 2023 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing, training improvements. Reproducing previous implementation metrics. Augmentation. #10

Preprocessing, training improvements. Reproducing previous implementation metrics. Augmentation. #10

josejuanmartinez commented Jul 21, 2023 •

edited

Preprocessing, training improvements. Reproducing previous implementation metrics. Augmentation. #10

Preprocessing, training improvements. Reproducing previous implementation metrics. Augmentation. #10

Conversation

josejuanmartinez commented Jul 21, 2023 • edited

josejuanmartinez commented Jul 21, 2023 •

edited