New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add DynamicBalanceClassSampler #954
Conversation
@@ -195,6 +195,100 @@ def __iter__(self) -> Iterator[int]: | |||
return iter(inds) | |||
|
|||
|
|||
class DynamicBalanceClassSampler(Sampler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @Dokholyan
Could you please provide a small example for this DynamicBalanceClassSampler
usage?
for example, like here - https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/sampler.py#L306L325
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, could you also please add this class to the docs? here - https://github.com/catalyst-team/catalyst/blob/master/docs/api/data.rst#samplers
but please keep the alphabetical order ;)
And is there any way to write tests to unsure sampler correctness? |
Я добавил пример использования и тест. У меня есть переменная в цикле которая не используется, я назвал ее _epoch(как вариант _ или просто i), при этом ваш кодстайл ругается на нее. А как правильно?) |
let's try |
@Dokholyan now it's your turn |
@Scitator Code Style swears at _ |
@Dokholyan nope, there is an error during test |
catalyst/data/tests/test_sampler.py
Outdated
current_d = new_d | ||
|
||
|
||
def test_dynamic_balance_class_sampler() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check how it should be done ;)
https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/tests/test_sampler.py#L116#L124
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok )
Pull request has been modified.
@Scitator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job,
Please, do the following:
Pick dataset with imbalance or pick mnist and change it to have imbalance
Do N epochs and save classes distribution as images
Show us N histograms where the first one with imbalance, the second with less imbalance and the last one is uniform
Pull request has been modified.
@@ -425,4 +566,5 @@ def __iter__(self): | |||
"MiniEpochSampler", | |||
"DistributedSamplerWrapper", | |||
"DynamicLenBatchSampler", | |||
"DynamicBalanceClassSampler", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add it to catalyst/data/__init__.py
?
catalyst/data/sampler.py
Outdated
>>> import torch | ||
>>> import numpy as np | ||
|
||
>>> from catalyst.data.sampler import DynamicBalanceClassSampler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> from catalyst.data.sampler import DynamicBalanceClassSampler | |
>>> from catalyst.data import DynamicBalanceClassSampler |
catalyst/data/sampler.py
Outdated
epoch: start epoch number can be useful for many stage experiments | ||
max_d: if not None, limit on the difference between the most | ||
frequent and the rarest classes, heuristic | ||
mode: if not None, it means the final class size in training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean the number of samples per class in the end
? why do we call it mode that? or could we make in
Union[str, int]`, so it could take values of "upsampling", "downsampling", or some specified number of samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added "downsampling". "Upsampling" doesn't work clearly
catalyst/data/sampler.py
Outdated
Args: | ||
labels: list of labels for each elem in the dataset | ||
exp_lambda: exponent figure for schedule | ||
epoch: start epoch number can be useful for many stage experiments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epoch: start epoch number can be useful for many stage experiments | |
epoch: start epoch number can be useful for multi-stage experiments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we name it more conveniently? start_epoch
? or something else? maybe @AlekseySh could also advice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, "start_epoch" is much better
@@ -195,6 +195,100 @@ def __iter__(self) -> Iterator[int]: | |||
return iter(inds) | |||
|
|||
|
|||
class DynamicBalanceClassSampler(Sampler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, could you also please add this class to the docs? here - https://github.com/catalyst-team/catalyst/blob/master/docs/api/data.rst#samplers
but please keep the alphabetical order ;)
Pull request has been modified.
catalyst/data/sampler.py
Outdated
self.min_class_size = min(list(samples_per_class.values())) | ||
|
||
if self.min_class_size < 100 and not ignore_warning: | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use logger.warning
?
e.g. https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/cv/__init__.py#L15
catalyst/data/sampler.py
Outdated
Let's define D_i = #C_i/ #C_min where #C_i is a size of class i and #C_min | ||
is a size of the rarest class, so D_i define class distribution. | ||
Also define g(n_epoch) is a exponential scheduler. On each epoch | ||
current D_i calculated as current D_i = D_i ^ g(n_epoch), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate it if you could use constructions like :math:D_1
instead of D_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bagxi , please give me an example
catalyst/data/sampler.py
Outdated
Note: In the end of the training, epochs will contain only | ||
min_size_class * n_classes examples. So, possible it will not necessary to | ||
do validation on each epoch. For this reason use ControlFlowCallback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: In the end of the training, epochs will contain only | |
min_size_class * n_classes examples. So, possible it will not necessary to | |
do validation on each epoch. For this reason use ControlFlowCallback. | |
Notes: | |
In the end of the training, epochs will contain only | |
min_size_class * n_classes examples. So, possible it will not necessary to | |
do validation on each epoch. For this reason use ControlFlowCallback. |
catalyst/data/sampler.py
Outdated
min_size_class * n_classes examples. So, possible it will not necessary to | ||
do validation on each epoch. For this reason use ControlFlowCallback. | ||
|
||
Usage example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage example: -> Examples:
catalyst/data/sampler.py
Outdated
>>> for batch in loader: | ||
>>> b_features, b_labels = batch | ||
|
||
Sampler was inspired by https://arxiv.org/pdf/1901.06783.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check sphinx docs and fix this link?
catalyst/data/sampler.py
Outdated
def __init__( | ||
self, | ||
labels: List[Union[int, str]], | ||
exp_lambda=0.9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing
catalyst/data/sampler.py
Outdated
labels: List[Union[int, str]], | ||
exp_lambda=0.9, | ||
start_epoch: int = 0, | ||
max_d: int = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Optional
catalyst/data/sampler.py
Outdated
labels: list of labels for each elem in the dataset | ||
exp_lambda: exponent figure for schedule | ||
start_epoch: start epoch number, can be useful for multi-stage | ||
experiments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please fix indentation?
catalyst/data/sampler.py
Outdated
self.epoch = start_epoch | ||
labels = np.array(labels) | ||
samples_per_class = Counter(labels) | ||
self.min_class_size = min(list(samples_per_class.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use list
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it is useless
for key, value in samples_per_class.items() | ||
} | ||
self.label2idxes = { | ||
label: np.arange(len(labels))[labels == label].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it will be better to use pure python instead of numpy + conversion to list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bagxi This code a simple copy from BalanceClassSampler
catalyst/catalyst/data/sampler.py
Line 36 in dfd21c5
self.lbl2idx = { |
A am not sure about "math:" in docstrings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
In the end of the training, epochs will contain only | ||
min_size_class * n_classes examples. So, possible it will not | ||
necessary to do validation on each epoch. For this reason use | ||
ControlFlowCallback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add import path for this callback?
Before submitting
catalyst-make-codestyle && catalyst-check-codestyle
(pip install -U catalyst-codestyle
).make check-docs
?Description
Related Issue
Type of Change
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.