Deep Learning Best Practices

Following the work made by Nishant Gaurav (fast.ai student) at http://forums.fast.ai/t/30-best-practices/12344 I'm sharing some Deep Learning best practices I learned following as an International Fellow the courses Practical Deep Learning for Coders and Cutting Edge Deep Learning for Coders (2018 version) taken by Jeremy Howard and Rachel Thomas at fast.ai.

Dataset

Do as much of your work as you can on a small sample of the data.
Public leaderboard of Kaggle is not a replacement for Validation Set. Jeremy showed how he was ranked in 5th in the private leaderboard of Rossmann Competition, whereas he was not in top 300 in public leaderboard. In another example, The test set of public leaderboard (in Iceberg Satellite image Competition) contained mostly of augmented images.
Look into data. Remove outliers which make sense and there is no other variable to capture those outliers: like in Rossmann Competition, the date&timings for closed stores were not known. There is extra sale before and after close period. So, if you don’t have any data to model the outliers, you need to remove them during training.
Look into training: After training the cats vs dogs we could see that some incorrect classified images were mostly misclassified due to cropping. The solution was data augmentation.
Data Augmentation: You cannot use all possible types of augmentation. For best results we need to use the right kind of augmentation. There are problems if you work on object detection and data augmentation because you have to change your Y according to your X. Fastai by default doesn’t add black padding at borders but flips and reflects the near pixel → this gets much better results.
Test Time Augmentation (TTA): Increases accuracy.
Many models have problems when missing values are present, so it’s always important to think about how to deal with them. In these cases, we are picking an arbitrary signal value that doesn’t otherwise appear in the data.

Learning Rate

Use learning rate finder and select a learning rate where convergence of loss is steep. Do not select the biggest possible learning rate.
When using a pretrained model on some dataset like imagenet, you need to use different learning rates when you are using that model for any new dataset. The initial layers need a smaller learning rate, and the deeper layers need a comparatively larger learning rate. When the new dataset is similar to original dataset (e.g. cats vs dogs is similar to imagenet but iceberg satellite image is not) the weights have a ratio of 10. But when using the imagenet of satellite model the successive weights should have a ratio of 3.
Cosine annealing: This now supported by default in pytorch 0.3.18.
SGD with restarts: 2 setups work very well.
To tackle Gradient Explosion we use identity matrix for intialization (for RNNs). Also allows higher learning rate.

Training

bn_freeze = True: In case you are using a deeper network anything greater than resnet34 (like resnext50), bn_freeze = True should be used when you unfreeze and your new dataset is very similar to the original data used in pretrained model. Pytorch is probably the only library which offers this much needed switch.
Gradual unfreezing of layers if you're doing transfer learning.
Progressive resizing of images: Train a CNN with 244x224 image size, save weights and then train again for others epochs with images of increased size - avoids overfitting and improves generalization.
Kaiming He Initialization: Pytorch has this implemented by default.
Adam Optimizer has a better version called AdamW.

Activation functions

theoretically softmax and logsoftmax are scaled version of each other, but empirically logsoftmax is better.
sigmoid instead of softmax for multi-label classification.
applying sigmoid in the end when you know the min and max of output relieves the neural network and training is faster. This is similar to applying softmax when you know the output should be probability.
In hidden state to hidden state transition weight matrices tanh is used.
LeakyReLU has no zero-gradients for negative x (difficult to update) and it has been shown that it gives better performances on small datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
img		img
learning-rate		learning-rate
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

learning-rate

learning-rate

README.md

README.md

Repository files navigation

Deep Learning Best Practices

Dataset

Learning Rate

Training

Activation functions

Architectures

About

Releases

Packages

alessiamarcolini/deep-learning_best-practices

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Best Practices

Dataset

Learning Rate

Training

Activation functions

Architectures

About

Resources

Stars

Watchers

Forks