add DynamicBalanceClassSampler #954

Dokholyan · 2020-10-07T15:52:07Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contribution guide?
Did you check the code style? catalyst-make-codestyle && catalyst-check-codestyle (pip install -U catalyst-codestyle).
Did you make sure to update the docs? We use Google format for all the methods and classes.
Did you check the docs with make check-docs?
Did you write any new necessary tests?
Did you add your new functionality to the docs?
Did you update the CHANGELOG?
You can use 'Login as guest' to see Teamcity build logs.

Description

Related Issue

Type of Change

Examples / docs / tutorials / contributors update
Bug fix (non-breaking change which fixes an issue)
Improvement (non-breaking change which improves an existing feature)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Scitator · 2020-10-07T16:11:54Z

catalyst/data/sampler.py

@@ -195,6 +195,100 @@ def __iter__(self) -> Iterator[int]:
        return iter(inds)


+class DynamicBalanceClassSampler(Sampler):


hi, @Dokholyan
Could you please provide a small example for this DynamicBalanceClassSampler usage?
for example, like here - https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/sampler.py#L306L325

btw, could you also please add this class to the docs? here - https://github.com/catalyst-team/catalyst/blob/master/docs/api/data.rst#samplers
but please keep the alphabetical order ;)

Scitator · 2020-10-07T16:13:15Z

And is there any way to write tests to unsure sampler correctness?
For example, something like https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/tests/test_sampler.py#L67 ?

Dokholyan · 2020-10-13T08:53:04Z

Я добавил пример использования и тест.

У меня есть переменная в цикле которая не используется, я назвал ее _epoch(как вариант _ или просто i), при этом ваш кодстайл ругается на нее. А как правильно?)

catalyst/data/tests/test_sampler.py

Scitator · 2020-10-13T18:17:09Z

let's try _ :)

catalyst/data/tests/test_sampler.py

Scitator · 2020-10-13T19:11:53Z

@Dokholyan now it's your turn

Dokholyan · 2020-10-14T16:49:58Z

@Scitator Code Style swears at _

Scitator · 2020-10-14T18:04:31Z

@Dokholyan nope, there is an error during test
please check the Details for codestyle check

Scitator · 2020-10-15T06:35:39Z

catalyst/data/tests/test_sampler.py

+        current_d = new_d
+
+
+def test_dynamic_balance_class_sampler() -> None:


please check how it should be done ;)
https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/tests/test_sampler.py#L116#L124

Pull request has been modified.

Dokholyan · 2020-10-16T15:16:31Z

@Scitator
I am a little confused(
I have changed the pull requests in order to change tests and now I see "check fail" in other parts of Catalyst(maybe master has been changed). So I am not sure about what I should do

AlekseySh

Great job,
Please, do the following:
Pick dataset with imbalance or pick mnist and change it to have imbalance
Do N epochs and save classes distribution as images
Show us N histograms where the first one with imbalance, the second with less imbalance and the last one is uniform

Pull request has been modified.

Scitator · 2020-11-08T07:30:24Z

catalyst/data/sampler.py

@@ -425,4 +566,5 @@ def __iter__(self):
    "MiniEpochSampler",
    "DistributedSamplerWrapper",
    "DynamicLenBatchSampler",
+    "DynamicBalanceClassSampler",


could you please add it to catalyst/data/__init__.py?

Scitator · 2020-11-08T07:30:49Z

catalyst/data/sampler.py

+        >>> import torch
+        >>> import numpy as np
+
+        >>> from catalyst.data.sampler import DynamicBalanceClassSampler


Suggested change

>>> from catalyst.data.sampler import DynamicBalanceClassSampler

>>> from catalyst.data import DynamicBalanceClassSampler

Scitator · 2020-11-08T07:33:34Z

catalyst/data/sampler.py

+            epoch: start epoch number can be useful for many stage experiments
+            max_d: if not None, limit on the difference between the most
+            frequent and the rarest classes, heuristic
+            mode: if not None, it means the final class size in training.


does this mean the number of samples per class in the end? why do we call it mode that? or could we make in Union[str, int]`, so it could take values of "upsampling", "downsampling", or some specified number of samples?

I have added "downsampling". "Upsampling" doesn't work clearly

Scitator · 2020-11-08T07:35:05Z

catalyst/data/sampler.py

+        Args:
+            labels: list of labels for each elem in the dataset
+            exp_lambda: exponent figure for schedule
+            epoch: start epoch number can be useful for many stage experiments


Suggested change

epoch: start epoch number can be useful for many stage experiments

epoch: start epoch number can be useful for multi-stage experiments

could we name it more conveniently? start_epoch? or something else? maybe @AlekseySh could also advice

Yes, "start_epoch" is much better

Scitator · 2020-11-08T07:38:00Z

catalyst/data/sampler.py

@@ -195,6 +195,100 @@ def __iter__(self) -> Iterator[int]:
        return iter(inds)


+class DynamicBalanceClassSampler(Sampler):


btw, could you also please add this class to the docs? here - https://github.com/catalyst-team/catalyst/blob/master/docs/api/data.rst#samplers
but please keep the alphabetical order ;)

Pull request has been modified.

bagxi · 2020-11-09T05:45:27Z

catalyst/data/sampler.py

+        self.min_class_size = min(list(samples_per_class.values()))
+
+        if self.min_class_size < 100 and not ignore_warning:
+            warnings.warn(


Could you please use logger.warning?
e.g. https://github.com/catalyst-team/catalyst/blob/master/catalyst/data/cv/__init__.py#L15

bagxi · 2020-11-09T05:47:45Z

catalyst/data/sampler.py

+    Let's define D_i = #C_i/ #C_min where #C_i is a size of class i and #C_min
+    is a size of the rarest class, so D_i define class distribution.
+    Also define g(n_epoch) is a exponential scheduler. On each epoch
+    current D_i  calculated as current D_i  = D_i ^ g(n_epoch),


I'd appreciate it if you could use constructions like :math:D_1 instead of D_1

@bagxi , please give me an example

bagxi · 2020-11-09T05:48:38Z

catalyst/data/sampler.py

+    Note: In the end of the training, epochs will contain only
+    min_size_class * n_classes examples. So, possible it will not necessary to
+    do validation on each epoch. For this reason use ControlFlowCallback.


Suggested change

Note: In the end of the training, epochs will contain only

min_size_class * n_classes examples. So, possible it will not necessary to

do validation on each epoch. For this reason use ControlFlowCallback.

Notes:

In the end of the training, epochs will contain only

min_size_class * n_classes examples. So, possible it will not necessary to

do validation on each epoch. For this reason use ControlFlowCallback.

bagxi · 2020-11-09T05:49:25Z

catalyst/data/sampler.py

+    min_size_class * n_classes examples. So, possible it will not necessary to
+    do validation on each epoch. For this reason use ControlFlowCallback.
+
+    Usage example:


Usage example: -> Examples:

bagxi · 2020-11-09T05:50:58Z

catalyst/data/sampler.py

+        >>> for batch in loader:
+        >>>     b_features, b_labels = batch
+
+    Sampler was inspired by https://arxiv.org/pdf/1901.06783.pdf


Could you please check sphinx docs and fix this link?

bagxi · 2020-11-09T05:51:16Z

catalyst/data/sampler.py

+    def __init__(
+        self,
+        labels: List[Union[int, str]],
+        exp_lambda=0.9,


bagxi · 2020-11-09T05:51:35Z

catalyst/data/sampler.py

+        labels: List[Union[int, str]],
+        exp_lambda=0.9,
+        start_epoch: int = 0,
+        max_d: int = None,


Please use Optional

bagxi · 2020-11-09T05:52:33Z

catalyst/data/sampler.py

+            labels: list of labels for each elem in the dataset
+            exp_lambda: exponent figure for schedule
+            start_epoch: start epoch number, can be useful for multi-stage
+             experiments


Could you please fix indentation?

bagxi · 2020-11-09T05:54:40Z

catalyst/data/sampler.py

+        self.epoch = start_epoch
+        labels = np.array(labels)
+        samples_per_class = Counter(labels)
+        self.min_class_size = min(list(samples_per_class.values()))


Do we need to use list here?

You are right, it is useless

bagxi · 2020-11-09T05:57:37Z

catalyst/data/sampler.py

+            for key, value in samples_per_class.items()
+        }
+        self.label2idxes = {
+            label: np.arange(len(labels))[labels == label].tolist()


Maybe it will be better to use pure python instead of numpy + conversion to list?

@bagxi This code a simple copy from BalanceClassSampler

catalyst/catalyst/data/sampler.py

Line 36 in dfd21c5

self.lbl2idx = {

Pull request has been modified.

Dokholyan · 2020-11-09T14:26:54Z

A am not sure about "math:" in docstrings(

AlekseySh

Looks good

AlekseySh · 2020-11-09T18:48:02Z

catalyst/data/sampler.py

+         In the end of the training, epochs will contain only
+         min_size_class * n_classes examples. So, possible it will not
+         necessary to do validation on each epoch. For this reason use
+         ControlFlowCallback.


can we also add import path for this callback?

add DynamicBalanceClassSampler

ffb45ff

Dokholyan requested review from bagxi, ditwoo and Scitator as code owners October 7, 2020 15:52

Scitator reviewed Oct 7, 2020

View reviewed changes

Dokholyan added 2 commits October 11, 2020 17:16

add DynamicBalanceClassSampler: add usage example

2146b89

add DynamicBalanceClassSampler: add tests

93a9d92

Scitator reviewed Oct 13, 2020

View reviewed changes

catalyst/data/tests/test_sampler.py Outdated Show resolved Hide resolved

Update catalyst/data/tests/test_sampler.py

8573676

Scitator reviewed Oct 13, 2020

View reviewed changes

catalyst/data/tests/test_sampler.py Outdated Show resolved Hide resolved

Update catalyst/data/tests/test_sampler.py

f4b21ae

Scitator previously requested changes Oct 15, 2020

View reviewed changes

add DynamicBalanceClassSampler: debag tests

a12a05c

AlekseySh previously requested changes Oct 20, 2020

View reviewed changes

update sampler: add mode

79332e1

Dokholyan added 2 commits November 7, 2020 17:32

add example notebook

ef33956

Merge remote-tracking branch 'original_C/master'

2ad65c6

Scitator requested changes Nov 8, 2020

View reviewed changes

Scitator previously requested changes Nov 8, 2020

View reviewed changes

Dokholyan added 2 commits November 8, 2020 14:07

sampler: fixes

d61fc8f

samler: docs

2be40b3

Merge remote-tracking branch 'original_C/master'

594328f

bagxi previously requested changes Nov 9, 2020

View reviewed changes

DynamicBalanceClassSampler: fixes

7c6a68e

Dokholyan added 2 commits November 9, 2020 21:04

change import order

f5dafe4

change import order

070a4ad

Scitator merged commit 94987c7 into catalyst-team:master Nov 9, 2020

AlekseySh reviewed Nov 9, 2020

View reviewed changes

		@@ -195,6 +195,100 @@ def __iter__(self) -> Iterator[int]:
		return iter(inds)


		class DynamicBalanceClassSampler(Sampler):

		current_d = new_d


		def test_dynamic_balance_class_sampler() -> None:

	>>> from catalyst.data.sampler import DynamicBalanceClassSampler
	>>> from catalyst.data import DynamicBalanceClassSampler

	epoch: start epoch number can be useful for many stage experiments
	epoch: start epoch number can be useful for multi-stage experiments

add DynamicBalanceClassSampler #954

add DynamicBalanceClassSampler #954

Conversation

Dokholyan commented Oct 7, 2020

Before submitting

Description

Related Issue

Type of Change

PR review

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scitator commented Oct 7, 2020

Dokholyan commented Oct 13, 2020

Scitator commented Oct 13, 2020

Scitator commented Oct 13, 2020

Dokholyan commented Oct 14, 2020

Scitator commented Oct 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dokholyan commented Oct 16, 2020

AlekseySh left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dokholyan Nov 9, 2020 • edited

Choose a reason for hiding this comment

Dokholyan commented Nov 9, 2020

AlekseySh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlekseySh left a comment •

edited

Dokholyan Nov 9, 2020 •

edited