Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

03 cifar pipeline #9

Merged
merged 16 commits into from Dec 15, 2022
Merged

03 cifar pipeline #9

merged 16 commits into from Dec 15, 2022

Conversation

philswatton
Copy link
Contributor

This pull request adds:

  • PyTorch, torchvision dependencies for macos (assumes M1). Raised issue System Dependencies for PyTorch (or TensorFlow) #8 to highlight this
  • scikit-learn dependency
  • a slightly adjusted CIFAR-10 datamodule. this adds some functionality to the core pl_bolts one:
    • 'drop' parameter in class init method
    • dataloader will use drop to drop that % from each class.
    • a function allowing the drop parameter to be easily modified.

Copy link
Contributor

@jack89roberts jack89roberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I've added a couple of comments, mostly minor Python-style things.

And a general ARC code-style thing to discuss separately: Maybe we should have recommended ARC styles for docstrings and type hints? E.g. I like type-hints in function definitions but personally tend not to go as far as things like *args: Any or -> None.

src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
Comment on lines +77 to +82
index, _ = train_test_split(
self.dataset_train.indices,
test_size=self.drop,
stratify=labels,
random_state=self.seed,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you likely to need the indices of the dropped rows for anything later? You could figure it out from the remaining ones that are kept but might make your life easier to store them here somehow if you might need them.

Also, if this function will be called multiple times from the same script you'd need to be wary that the random seed will always have the same value (unless you change self.seed between calls).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will need them later - we just need a certain % to be dropped via the dataloader in a consistent way

I did the seed bit intentionally - the idea is that given a certain seed, you can create the same datasets A and B but then set one to drop a % of the training data. My understanding in this case is we would want that stage to then be replicable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely want them to be replicable 🙂

It was more a general comment than a suggestion to change anything, just that doing the below will give identical A and B, so in a situation where you want A and B to have different indices dropped you'd need to create a new instance of the class (or change the implementation):

cifar = CIFAR10DataModulePlus(drop=0.2, seed=123)
A = cifar.train_dataloader()
B = cifar.train_dataloader()
# A and B are identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the next case to implement and I guess will require either a new class or refactoring of this one. Here though the idea is to have A and B be identical except for dropping some examples from one but not both of them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm ok with A and B being separate instances of CIFAR10DataModulePlus rather than separate calls to the train_dataloder.

A = CIFAR10DataModulePlus(drop=0.0, seed=123) # could just use CIFAR10DataModule, of course!
B = CIFAR10DataModulePlus(drop=0.2, seed=123)

It wouldn't be the most efficient if you needed to hold A and B in memory together but you should only need one at a time for our experiments.

@jack89roberts
Copy link
Contributor

Also, maybe a good opportunity to try to add a test for this?

philswatton and others added 2 commits December 12, 2022 14:02
@philswatton
Copy link
Contributor Author

I've pushed a couple more commits that:

  • Change the class name to be more informative
  • Removed the superflous method to change the drop%

I've also accepted the suggested move of comment into docstring

@philswatton
Copy link
Contributor Author

I've further added a test folder with a simple test for the dataloader. I'll think about some more tests to add tomorrow

Copy link
Contributor

@jack89roberts jack89roberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! That test is probably enough for now IMO. Being picky, Python style convention would be to call the function test_drop_loader rather than testDropLoader.

Copy link
Contributor

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, Phil, really clean and a nice size of PR to review. Great stuff on the typing too :D

A few minor points in comments

Also, not strictly this project but could you port over appropriate bits of the README from https://github.com/alan-turing-institute/ARC-project-template ?

Comment on lines +77 to +82
index, _ = train_test_split(
self.dataset_train.indices,
test_size=self.drop,
stratify=labels,
random_state=self.seed,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm ok with A and B being separate instances of CIFAR10DataModulePlus rather than separate calls to the train_dataloder.

A = CIFAR10DataModulePlus(drop=0.0, seed=123) # could just use CIFAR10DataModule, of course!
B = CIFAR10DataModulePlus(drop=0.2, seed=123)

It wouldn't be the most efficient if you needed to hold A and B in memory together but you should only need one at a time for our experiments.

pyproject.toml Show resolved Hide resolved
tests/test_data.py Show resolved Hide resolved
tests/test_data.py Outdated Show resolved Hide resolved
@lannelin
Copy link
Contributor

lannelin commented Dec 14, 2022

Nice! I've added a couple of comments, mostly minor Python-style things.

And a general ARC code-style thing to discuss separately: Maybe we should have recommended ARC styles for docstrings and type hints? E.g. I like type-hints in function definitions but personally tend not to go as far as things like *args: Any or -> None.

Agree @jack89roberts, we should have a recommendation for style here. I'm personally a big fan of typing and like the clarity of -> None etc. I'm more relaxed on docstrings if we have type hints and happy just to use them for anything that's not obvious from the hints or to just give a general explanation for a function with behavior isn't immediately obvious from the name.

Let's discuss at next team meeting how we do style guide stuff

Copy link
Contributor

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. One very minor comment but happy to be merged in after that!

tests/test_data.py Show resolved Hide resolved
@philswatton philswatton merged commit 1765c68 into develop Dec 15, 2022
@philswatton philswatton deleted the 03_cifar_pipeline branch December 15, 2022 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants