-
Notifications
You must be signed in to change notification settings - Fork 7k
Add SSDlite architecture with MobileNetV3 backbones #3757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks a lot Vasilis!
I have a couple of comments, let me know what you think
# Enable [-1, 1] rescaling and reduced tail if no pretrained backbone is selected | ||
rescaling = reduce_tail = not pretrained_backbone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing, but I assume the [-1, 1]
rescaling is necessary to get best results given the current settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. Rescaling was part of the changes needed to boost the accuracy by 1mAP.
backbone = _mobilenet_extractor("mobilenet_v3_large", progress, pretrained_backbone, trainable_backbone_layers, | ||
norm_layer, rescaling, _reduced_tail=reduce_tail, _width_mult=1.0) | ||
|
||
size = (320, 320) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the size is hard-coded and even if the user passes a different size
**kwargs
in the constructor it won't be used?
What about doing something like
size = kwargs.get("size", (320, 320))
instead, so that the users can potentially customize the input size if they wish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to hardcode it because this is the ssdlite320
model which uses a fixed 320x320 size. The input size is much less flexible on SSD models comparing to FasterRCNN because they make a few strong assumptions about the input.
If someone wants to use a different size, it would be simpler to just create the backbone, configure the DefaultBoxGenerator and then initialize directly the SSD
with the config of their choice. Overall I felt that this approach would be simpler than trying to offer an API that tries to cover all user needs.
kwargs = {**defaults, **kwargs} | ||
model = SSD(backbone, anchor_generator, size, num_classes, | ||
head=SSDLiteHead(out_channels, num_anchors, num_classes, norm_layer), | ||
image_mean=[0., 0., 0.], image_std=[1., 1., 1.], **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum, interesting.
I would have expected that we could have removed the rescaling
part and instead changed mean
/ std
here to be image_mean=0.5, image_std=0.5
, but I assume this wasn't done due to padded regions having a different value than what you would have liked, is that correct?
Also, this will probably mean that even if you were to use a pretrained
backbone, it wouldn't give good results because you are passing a non-default image mean / std.
In this case, might be better to disable passing a pretrained
backbone altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Here I'm trying to be as close to the canonical implementation as possible and that helped me close the gap in the accuracy.
You are right to say that a pretrained backbone would need different mean/std. Thankfully because our setup is to train end-to-end and use extensive BN, the backbone adapts to the different input fairly quickly even when one uses a pre-trained backbone. In the end, as I trained it for quite a few epochs, it was better to start from random weights which led to a better result (this is a common finding in similar setups).
Though indeed overall for the API it might be simpler to disable passing a pretrained backbone, this means that the API for SSDlite will be different from any other model. It will also create issues with our training scripts that expect to be able to pass this parameter. I think what I will do to address better this remark is make the mean/std configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the answers Vasilis!
Chatted offline, let's get this PR merged and then follow up with creating a few issues to investigate some of the points that I brought.
Summary: * Partial implementation of SSDlite. * Add normal init and BN hyperparams. * Refactor to keep JIT happy * Completed SSDlite. * Fix lint * Update todos * Add expected file in repo. * Use C4 expansion instead of C4 output. * Change scales formula for Default Boxes. * Add cosine annealing on trainer. * Make T_max count epochs. * Fix test and handle corner-case. * Add support of support width_mult * Add ssdlite presets. * Change ReLU6, [-1,1] rescaling, backbone init & no pretraining. * Use _reduced_tail=True. * Add sync BN support. * Adding the best config along with its weights and documentation. * Make mean/std configurable. * Fix not implemented for half exception Reviewed By: cpuhrsch Differential Revision: D28538769 fbshipit-source-id: df6c2e79b76e6d6297aa51ca0ff4535dc59eaf9b
) | ||
|
||
get_depth = lambda d: max(min_depth, int(d * width_mult)) # noqa: E731 | ||
extra = nn.ModuleList([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datumbox could you please help me figure it out - I cannot find the info about these extra layers in the papers. Where did you get them from?
I'm trying to create a modification for this model and struggle to understand it - any help would be appreciated!
I want to reduce the number of encoder layers to make feature maps detect small objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datumbox Thank you for the quick reply! It's very helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datumbox in 6.3 of MobileNet3 paper, I only see the info on connecting C4 and C5 layers to the SSD head. There is nothing on these extra layers there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked the reference code I sent? This comes from their official repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see that in the TensorFlow implementation.
I'm trying to understand if I'm reducing the depth of C4 (and thus the output stride for targeting super small objects) - how should I change the rest of the layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's been quite sometime since I wrote the implementation. I think you will need to dig into the original research repo to get the details.
Resolves #1422, fixes #3757
This PR implements SSDlite with MobileNetV3 backbone as outlined in the papers [1] and [2].
Trained using the code committed at 8aa3f58. The current best pre-trained model was trained with (using latest git hash):
Submitted batch job 40959060, 41037042, 41046786
Accuracy metrics at 4ca472e:
Validated with:
Speed benchmark:
0.09 sec per image on CPU