Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document datasets #3060

Merged
merged 33 commits into from
Jun 4, 2024
Merged

Document datasets #3060

merged 33 commits into from
Jun 4, 2024

Conversation

flying-sheep
Copy link
Member

@flying-sheep flying-sheep commented May 14, 2024

  • Release notes not necessary because:

TODO:

  • release notes
  • some added text explaining things
  • run internet tests, implement caching for datasets

Optional:

  1. continue to not run the internet tests in CI. A side effect of this PR is that our tests get less flaky by not running the flaky ebi_expression_atlas doctest
  2. run internet tests in CI
    1. add caching to CI
    2. make sure the dataset functions don’t download already-downloaded data
    3. validate cached data instead
    4. run the internet tests (with caching) in CI

Rendered

Copy link

codecov bot commented May 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.35%. Comparing base (8d046ff) to head (0cb201f).

Current head 0cb201f differs from pull request most recent head 0caa293

Please upload reports for the commit 0caa293 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3060      +/-   ##
==========================================
+ Coverage   75.87%   76.35%   +0.47%     
==========================================
  Files         110      110              
  Lines       12536    12545       +9     
==========================================
+ Hits         9512     9579      +67     
+ Misses       3024     2966      -58     
Files Coverage Δ
scanpy/_utils/_doctests.py 94.73% <100.00%> (+0.98%) ⬆️
scanpy/datasets/_datasets.py 100.00% <100.00%> (+13.95%) ⬆️
scanpy/datasets/_ebi_expression_atlas.py 92.94% <100.00%> (+2.57%) ⬆️
scanpy/datasets/_utils.py 100.00% <100.00%> (ø)
scanpy/preprocessing/_deprecated/__init__.py 89.47% <ø> (ø)
...preprocessing/_deprecated/highly_variable_genes.py 95.40% <ø> (ø)

... and 1 file with indirect coverage changes

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially important is if its .X is logarithmized, normalized, and/or filtered

Are we documenting here which of these have counts vs log vs normalized?

continue to not run the internet tests in CI. A side effect of this PR is that our tests get less flaky by not running the flaky ebi_expression_atlas doctest

What was stopping this before?

    run internet tests in CI
        add caching to CI
        make sure the dataset functions don’t download already-downloaded data
        validate cached data instead
        run the internet tests (with caching) in CI

why wouldn't we want to download the data everytime? I could see it slowing things down a bit but not so much. We should at least have some sort of cache timeout so that it forces re-download every so often to ensure that aspect of things still works

@flying-sheep
Copy link
Member Author

flying-sheep commented May 17, 2024

Are we documenting here which of these have counts vs log vs normalized?

yeah, I’d like to do that! It’s really not bad

hatch test --internet-tests scanpy/tests/test_datasets.py::test_doc_shape scanpy/datasets/
[...]du -a .pytest_cache/d/scanpy-data/ | reject directories files apparent
╭───┬──────────────────────────────────────────────────────────────────────┬──────────╮
│ # │                                 path                                 │ physical │
├───┼──────────────────────────────────────────────────────────────────────┼──────────┤
│ 0 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data │ 199.6 MB │
╰───┴──────────────────────────────────────────────────────────────────────┴──────────╯du -a .pytest_cache/d/scanpy-data/* | reject directories files apparent
╭───┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────╮
│ # │                                                        path                                                        │ physical │
├───┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────┤
│ 0 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/E-MTAB-4888                                   │  71.1 MB │
│ 1 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/Targeted_Visium_Human_Glioblastoma_Pan_Cancer │  19.7 MB │
│ 2 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/V1_Breast_Cancer_Block_A_Section_1            │  48.3 MB │
│ 3 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/burczynski06                                  │  16.3 MB │
│ 4 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/moignard15                                    │   3.4 MB │
│ 5 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/paul15                                        │  10.3 MB │
│ 6 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/pbmc3k_processed.h5ad                         │  24.7 MB │
│ 7 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/pbmc3k_raw.h5ad                               │   5.9 MB │
╰───┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────╯

What was stopping this before? […] why wouldn't we want to download the data everytime?

someone implementing the caching, so nothing much really

@flying-sheep flying-sheep marked this pull request as draft May 17, 2024 15:58
@flying-sheep flying-sheep marked this pull request as ready for review June 4, 2024 11:21
Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing test seems to be coming from #3068

No blockers

@pytest.mark.internet
def test_visium_datasets_dir_change(tmp_path: Path):
"""Test that changing the dataset dir doesn't break reading."""
with pytest.warns(UserWarning, match=r"Variable names are not unique"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use here (and elsewhere) the r prefix? It seems unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify that match accepts regexes. If the r wasn’t there, it would be easy to accidentally add a backslash escape that’s intended for re and have Python do things with it instead.

scanpy/datasets/_datasets.py Outdated Show resolved Hide resolved
@flying-sheep flying-sheep merged commit 4f40d68 into main Jun 4, 2024
4 of 12 checks passed
@flying-sheep flying-sheep deleted the doc-datasets branch June 4, 2024 13:36
meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Jun 4, 2024
flying-sheep added a commit that referenced this pull request Jun 4, 2024
Co-authored-by: Philipp A <flying-sheep@web.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More complete dataset documentation
2 participants