Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DynamicBalanceClassSampler #954

Merged
merged 15 commits into from Nov 9, 2020
Merged

Conversation

Dokholyan
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contribution guide?
  • Did you check the code style? catalyst-make-codestyle && catalyst-check-codestyle (pip install -U catalyst-codestyle).
  • Did you make sure to update the docs? We use Google format for all the methods and classes.
  • Did you check the docs with make check-docs?
  • Did you write any new necessary tests?
  • Did you add your new functionality to the docs?
  • Did you update the CHANGELOG?
  • You can use 'Login as guest' to see Teamcity build logs.

Description

Related Issue

Type of Change

  • Examples / docs / tutorials / contributors update
  • Bug fix (non-breaking change which fixes an issue)
  • Improvement (non-breaking change which improves an existing feature)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

@@ -195,6 +195,100 @@ def __iter__(self) -> Iterator[int]:
return iter(inds)


class DynamicBalanceClassSampler(Sampler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @Dokholyan
Could you please provide a small example for this DynamicBalanceClassSampler usage?
for example, like here - https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/sampler.py#L306L325

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, could you also please add this class to the docs? here - https://github.com/catalyst-team/catalyst/blob/master/docs/api/data.rst#samplers
but please keep the alphabetical order ;)

@Scitator
Copy link
Member

Scitator commented Oct 7, 2020

And is there any way to write tests to unsure sampler correctness?
For example, something like https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/tests/test_sampler.py#L67 ?

@Dokholyan
Copy link
Contributor Author

Я добавил пример использования и тест.

У меня есть переменная в цикле которая не используется, я назвал ее _epoch(как вариант _ или просто i), при этом ваш кодстайл ругается на нее. А как правильно?)

@Scitator
Copy link
Member

let's try _ :)

@Scitator
Copy link
Member

@Dokholyan now it's your turn

@Dokholyan
Copy link
Contributor Author

@Scitator Code Style swears at _

@Scitator
Copy link
Member

@Dokholyan nope, there is an error during test
please check the Details for codestyle check

current_d = new_d


def test_dynamic_balance_class_sampler() -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok )

@mergify mergify bot dismissed Scitator’s stale review October 15, 2020 17:03

Pull request has been modified.

@Dokholyan
Copy link
Contributor Author

@Scitator
I am a little confused(
I have changed the pull requests in order to change tests and now I see "check fail" in other parts of Catalyst(maybe master has been changed). So I am not sure about what I should do

Copy link
Contributor

@AlekseySh AlekseySh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job,
Please, do the following:
Pick dataset with imbalance or pick mnist and change it to have imbalance
Do N epochs and save classes distribution as images
Show us N histograms where the first one with imbalance, the second with less imbalance and the last one is uniform

@mergify mergify bot dismissed AlekseySh’s stale review November 7, 2020 14:21

Pull request has been modified.

@@ -425,4 +566,5 @@ def __iter__(self):
"MiniEpochSampler",
"DistributedSamplerWrapper",
"DynamicLenBatchSampler",
"DynamicBalanceClassSampler",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add it to catalyst/data/__init__.py?

>>> import torch
>>> import numpy as np

>>> from catalyst.data.sampler import DynamicBalanceClassSampler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> from catalyst.data.sampler import DynamicBalanceClassSampler
>>> from catalyst.data import DynamicBalanceClassSampler

epoch: start epoch number can be useful for many stage experiments
max_d: if not None, limit on the difference between the most
frequent and the rarest classes, heuristic
mode: if not None, it means the final class size in training.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean the number of samples per class in the end? why do we call it mode that? or could we make in Union[str, int]`, so it could take values of "upsampling", "downsampling", or some specified number of samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added "downsampling". "Upsampling" doesn't work clearly

Args:
labels: list of labels for each elem in the dataset
exp_lambda: exponent figure for schedule
epoch: start epoch number can be useful for many stage experiments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
epoch: start epoch number can be useful for many stage experiments
epoch: start epoch number can be useful for multi-stage experiments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we name it more conveniently? start_epoch? or something else? maybe @AlekseySh could also advice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "start_epoch" is much better

@@ -195,6 +195,100 @@ def __iter__(self) -> Iterator[int]:
return iter(inds)


class DynamicBalanceClassSampler(Sampler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, could you also please add this class to the docs? here - https://github.com/catalyst-team/catalyst/blob/master/docs/api/data.rst#samplers
but please keep the alphabetical order ;)

@mergify mergify bot dismissed Scitator’s stale review November 8, 2020 11:09

Pull request has been modified.

bagxi
bagxi previously requested changes Nov 9, 2020
self.min_class_size = min(list(samples_per_class.values()))

if self.min_class_size < 100 and not ignore_warning:
warnings.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define D_i = #C_i/ #C_min where #C_i is a size of class i and #C_min
is a size of the rarest class, so D_i define class distribution.
Also define g(n_epoch) is a exponential scheduler. On each epoch
current D_i calculated as current D_i = D_i ^ g(n_epoch),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate it if you could use constructions like :math:D_1 instead of D_1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bagxi , please give me an example

Comment on lines 210 to 212
Note: In the end of the training, epochs will contain only
min_size_class * n_classes examples. So, possible it will not necessary to
do validation on each epoch. For this reason use ControlFlowCallback.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note: In the end of the training, epochs will contain only
min_size_class * n_classes examples. So, possible it will not necessary to
do validation on each epoch. For this reason use ControlFlowCallback.
Notes:
In the end of the training, epochs will contain only
min_size_class * n_classes examples. So, possible it will not necessary to
do validation on each epoch. For this reason use ControlFlowCallback.

min_size_class * n_classes examples. So, possible it will not necessary to
do validation on each epoch. For this reason use ControlFlowCallback.

Usage example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage example: -> Examples:

>>> for batch in loader:
>>> b_features, b_labels = batch

Sampler was inspired by https://arxiv.org/pdf/1901.06783.pdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check sphinx docs and fix this link?

def __init__(
self,
labels: List[Union[int, str]],
exp_lambda=0.9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing

labels: List[Union[int, str]],
exp_lambda=0.9,
start_epoch: int = 0,
max_d: int = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Optional

labels: list of labels for each elem in the dataset
exp_lambda: exponent figure for schedule
start_epoch: start epoch number, can be useful for multi-stage
experiments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix indentation?

self.epoch = start_epoch
labels = np.array(labels)
samples_per_class = Counter(labels)
self.min_class_size = min(list(samples_per_class.values()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use list here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it is useless

for key, value in samples_per_class.items()
}
self.label2idxes = {
label: np.arange(len(labels))[labels == label].tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it will be better to use pure python instead of numpy + conversion to list?

Copy link
Contributor Author

@Dokholyan Dokholyan Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bagxi This code a simple copy from BalanceClassSampler

self.lbl2idx = {

@mergify mergify bot dismissed bagxi’s stale review November 9, 2020 14:25

Pull request has been modified.

@Dokholyan
Copy link
Contributor Author

A am not sure about "math:" in docstrings(

@Scitator Scitator merged commit 94987c7 into catalyst-team:master Nov 9, 2020
Copy link
Contributor

@AlekseySh AlekseySh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

In the end of the training, epochs will contain only
min_size_class * n_classes examples. So, possible it will not
necessary to do validation on each epoch. For this reason use
ControlFlowCallback.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add import path for this callback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants