Skip to content

Add SSDlite architecture with MobileNetV3 backbones #3757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
May 11, 2021

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Apr 30, 2021

Resolves #1422, fixes #3757

This PR implements SSDlite with MobileNetV3 backbone as outlined in the papers [1] and [2].


Trained using the code committed at 8aa3f58. The current best pre-trained model was trained with (using latest git hash):

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
    --dataset coco --model ssdlite320_mobilenet_v3_large --epochs 660\
    --aspect-ratio-group-factor 3 --lr-scheduler cosineannealinglr --lr 0.15 --batch-size 24\
    --weight-decay 0.00004 --data-augmentation ssdlite

Submitted batch job 40959060, 41037042, 41046786

Accuracy metrics at 4ca472e:

Epoch 648 (reconfigured):
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.213
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.343
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.221
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.011
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.202
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.444
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.208
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.307
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.334
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.043
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.344
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.643

Validated with:

python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py\
   --dataset coco --model ssdlite320_mobilenet_v3_large --pretrained --test-only

Speed benchmark:
0.09 sec per image on CPU

Verified

This commit was signed with the committer’s verified signature.
IvanGoncharov Ivan Goncharov
@datumbox datumbox marked this pull request as draft April 30, 2021 17:21
@datumbox datumbox mentioned this pull request Apr 30, 2021
13 tasks
@datumbox datumbox marked this pull request as ready for review May 10, 2021 13:26
@datumbox datumbox changed the title [WIP] Add SSDlite architecture with MobileNetV3 backbones Add SSDlite architecture with MobileNetV3 backbones May 10, 2021
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot Vasilis!

I have a couple of comments, let me know what you think

Comment on lines +195 to +196
# Enable [-1, 1] rescaling and reduced tail if no pretrained backbone is selected
rescaling = reduce_tail = not pretrained_backbone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing, but I assume the [-1, 1] rescaling is necessary to get best results given the current settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Rescaling was part of the changes needed to boost the accuracy by 1mAP.

backbone = _mobilenet_extractor("mobilenet_v3_large", progress, pretrained_backbone, trainable_backbone_layers,
norm_layer, rescaling, _reduced_tail=reduce_tail, _width_mult=1.0)

size = (320, 320)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the size is hard-coded and even if the user passes a different size **kwargs in the constructor it won't be used?

What about doing something like

size = kwargs.get("size", (320, 320))

instead, so that the users can potentially customize the input size if they wish?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to hardcode it because this is the ssdlite320 model which uses a fixed 320x320 size. The input size is much less flexible on SSD models comparing to FasterRCNN because they make a few strong assumptions about the input.

If someone wants to use a different size, it would be simpler to just create the backbone, configure the DefaultBoxGenerator and then initialize directly the SSD with the config of their choice. Overall I felt that this approach would be simpler than trying to offer an API that tries to cover all user needs.

kwargs = {**defaults, **kwargs}
model = SSD(backbone, anchor_generator, size, num_classes,
head=SSDLiteHead(out_channels, num_anchors, num_classes, norm_layer),
image_mean=[0., 0., 0.], image_std=[1., 1., 1.], **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum, interesting.

I would have expected that we could have removed the rescaling part and instead changed mean / std here to be image_mean=0.5, image_std=0.5, but I assume this wasn't done due to padded regions having a different value than what you would have liked, is that correct?

Also, this will probably mean that even if you were to use a pretrained backbone, it wouldn't give good results because you are passing a non-default image mean / std.

In this case, might be better to disable passing a pretrained backbone altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Here I'm trying to be as close to the canonical implementation as possible and that helped me close the gap in the accuracy.

You are right to say that a pretrained backbone would need different mean/std. Thankfully because our setup is to train end-to-end and use extensive BN, the backbone adapts to the different input fairly quickly even when one uses a pre-trained backbone. In the end, as I trained it for quite a few epochs, it was better to start from random weights which led to a better result (this is a common finding in similar setups).

Though indeed overall for the API it might be simpler to disable passing a pretrained backbone, this means that the API for SSDlite will be different from any other model. It will also create issues with our training scripts that expect to be able to pass this parameter. I think what I will do to address better this remark is make the mean/std configurable.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the answers Vasilis!

Chatted offline, let's get this PR merged and then follow up with creating a few issues to investigate some of the points that I brought.

@datumbox datumbox merged commit 43d7720 into pytorch:master May 11, 2021
@datumbox datumbox deleted the models/ssdlite branch May 11, 2021 14:50
facebook-github-bot pushed a commit that referenced this pull request May 19, 2021
Summary:
* Partial implementation of SSDlite.

* Add normal init and BN hyperparams.

* Refactor to keep JIT happy

* Completed SSDlite.

* Fix lint

* Update todos

* Add expected file in repo.

* Use C4 expansion instead of C4 output.

* Change scales formula for Default Boxes.

* Add cosine annealing on trainer.

* Make T_max count epochs.

* Fix test and handle corner-case.

* Add support of support width_mult

* Add ssdlite presets.

* Change ReLU6, [-1,1] rescaling, backbone init & no pretraining.

* Use _reduced_tail=True.

* Add sync BN support.

* Adding the best config along with its weights and documentation.

* Make mean/std configurable.

* Fix not implemented for half exception

Reviewed By: cpuhrsch

Differential Revision: D28538769

fbshipit-source-id: df6c2e79b76e6d6297aa51ca0ff4535dc59eaf9b
)

get_depth = lambda d: max(min_depth, int(d * width_mult)) # noqa: E731
extra = nn.ModuleList([
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datumbox could you please help me figure it out - I cannot find the info about these extra layers in the papers. Where did you get them from?
I'm trying to create a modification for this model and struggle to understand it - any help would be appreciated!
I want to reduce the number of encoder layers to make feature maps detect small objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evekeen I have written a blogpost about the implementation details of this model. See here. The extra layers are described on section 6.3 of the paper though to get their exact values you need to dig in the original TF code. Hope that helps!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datumbox Thank you for the quick reply! It's very helpful

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datumbox in 6.3 of MobileNet3 paper, I only see the info on connecting C4 and C5 layers to the SSD head. There is nothing on these extra layers there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked the reference code I sent? This comes from their official repo.

Copy link

@evekeen evekeen Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see that in the TensorFlow implementation.
I'm trying to understand if I'm reducing the depth of C4 (and thus the output stride for targeting super small objects) - how should I change the rest of the layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's been quite sometime since I wrote the implementation. I think you will need to dig into the original research repo to get the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

please add other detectors
4 participants