Document datasets #3060

flying-sheep · 2024-05-14T14:54:14Z

Closes More complete dataset documentation #3051
Tests included or not required because:

Release notes not necessary because:

TODO:

release notes
some added text explaining things
run internet tests, implement caching for datasets

Optional:

continue to not run the internet tests in CI. A side effect of this PR is that our tests get less flaky by not running the flaky ebi_expression_atlas doctest
run internet tests in CI
1. add caching to CI
2. make sure the dataset functions don’t download already-downloaded data
3. validate cached data instead
4. run the internet tests (with caching) in CI

Rendered

codecov · 2024-05-14T15:29:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.35%. Comparing base (8d046ff) to head (0cb201f).

❗ Current head 0cb201f differs from pull request most recent head 0caa293

Please upload reports for the commit 0caa293 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3060      +/-   ##
==========================================
+ Coverage   75.87%   76.35%   +0.47%     
==========================================
  Files         110      110              
  Lines       12536    12545       +9     
==========================================
+ Hits         9512     9579      +67     
+ Misses       3024     2966      -58

Files	Coverage Δ
scanpy/_utils/_doctests.py	`94.73% <100.00%> (+0.98%)`	⬆️
scanpy/datasets/_datasets.py	`100.00% <100.00%> (+13.95%)`	⬆️
scanpy/datasets/_ebi_expression_atlas.py	`92.94% <100.00%> (+2.57%)`	⬆️
scanpy/datasets/_utils.py	`100.00% <100.00%> (ø)`
scanpy/preprocessing/_deprecated/__init__.py	`89.47% <ø> (ø)`
...preprocessing/_deprecated/highly_variable_genes.py	`95.40% <ø> (ø)`

... and 1 file with indirect coverage changes

ilan-gold

Especially important is if its .X is logarithmized, normalized, and/or filtered

Are we documenting here which of these have counts vs log vs normalized?

continue to not run the internet tests in CI. A side effect of this PR is that our tests get less flaky by not running the flaky ebi_expression_atlas doctest

What was stopping this before?

    run internet tests in CI
        add caching to CI
        make sure the dataset functions don’t download already-downloaded data
        validate cached data instead
        run the internet tests (with caching) in CI

why wouldn't we want to download the data everytime? I could see it slowing things down a bit but not so much. We should at least have some sort of cache timeout so that it forces re-download every so often to ensure that aspect of things still works

flying-sheep · 2024-05-17T10:30:59Z

Are we documenting here which of these have counts vs log vs normalized?

yeah, I’d like to do that! It’s really not bad

❯ hatch test --internet-tests scanpy/tests/test_datasets.py::test_doc_shape scanpy/datasets/
[...]

❯ du -a .pytest_cache/d/scanpy-data/ | reject directories files apparent
╭───┬──────────────────────────────────────────────────────────────────────┬──────────╮
│ # │                                 path                                 │ physical │
├───┼──────────────────────────────────────────────────────────────────────┼──────────┤
│ 0 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data │ 199.6 MB │
╰───┴──────────────────────────────────────────────────────────────────────┴──────────╯


❯ du -a .pytest_cache/d/scanpy-data/* | reject directories files apparent
╭───┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────╮
│ # │                                                        path                                                        │ physical │
├───┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────┤
│ 0 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/E-MTAB-4888                                   │  71.1 MB │
│ 1 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/Targeted_Visium_Human_Glioblastoma_Pan_Cancer │  19.7 MB │
│ 2 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/V1_Breast_Cancer_Block_A_Section_1            │  48.3 MB │
│ 3 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/burczynski06                                  │  16.3 MB │
│ 4 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/moignard15                                    │   3.4 MB │
│ 5 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/paul15                                        │  10.3 MB │
│ 6 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/pbmc3k_processed.h5ad                         │  24.7 MB │
│ 7 │ /home/phil/Dev/Python/Single Cell/scanpy/.pytest_cache/d/scanpy-data/pbmc3k_raw.h5ad                               │   5.9 MB │
╰───┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────╯

What was stopping this before? […] why wouldn't we want to download the data everytime?

someone implementing the caching, so nothing much really

ilan-gold

Failing test seems to be coming from #3068

No blockers

ilan-gold · 2024-06-04T12:38:10Z

scanpy/tests/test_datasets.py

+@pytest.mark.internet
+def test_visium_datasets_dir_change(tmp_path: Path):
+    """Test that changing the dataset dir doesn't break reading."""
+    with pytest.warns(UserWarning, match=r"Variable names are not unique"):


Why use here (and elsewhere) the r prefix? It seems unnecessary.

To clarify that match accepts regexes. If the r wasn’t there, it would be easy to accidentally add a backslash escape that’s intended for re and have Python do things with it instead.

scanpy/datasets/_datasets.py

Co-authored-by: Philipp A <flying-sheep@web.de>

flying-sheep added 8 commits May 14, 2024 13:28

Simplify dataset code and docs

5253fe0

remove weird references

c541e15

Fix test warnings

6a02822

Add some failing tests

e0e13f3

Merge branch 'main' into doc-datasets

d85b718

Add non-internet tests

e40135f

warnings

489b062

add repr of all datasets

a948527

flying-sheep added the Area - Documentation 📒 label May 14, 2024

flying-sheep added this to the 1.10.2 milestone May 14, 2024

flying-sheep added 3 commits May 14, 2024 17:02

Fix up EBI expr atlas too

1c2ea16

better doctest

f9cef37

Link fixes

3fbead3

flying-sheep requested a review from ilan-gold May 14, 2024 15:22

ilan-gold reviewed May 15, 2024

View reviewed changes

flying-sheep added 3 commits May 17, 2024 12:26

Merge branch 'main' into doc-datasets

7bb0baa

relnotes

6876946

type-only

1d263bd

flying-sheep added 4 commits May 17, 2024 12:48

globally modify datasetdir

0f3c8ed

split visium tests to avoid too much downloading

1630f99

remove duplication of setup

c51da3b

fix mention of writedir

b8a8443

flying-sheep marked this pull request as draft May 17, 2024 15:58

flying-sheep added 5 commits May 31, 2024 14:49

Merge branch 'main' into doc-datasets

e8b288c

add cache

78c9e0f

activate internet tests

32c7951

add dependencies for internet tests

a66e12f

missed one

3c2487d

flying-sheep added 6 commits May 31, 2024 16:54

ugh

b1aa9b5

Merge branch 'main' into doc-datasets

e1461c8

Blob example

03f102f

moignard15, krumsiek11

4d2942c

more moignard

ccc7d9c

Merge branch 'main' into doc-datasets

f68c2de

flying-sheep force-pushed the doc-datasets branch from 4877994 to f68c2de Compare June 3, 2024 15:26

flying-sheep added 2 commits June 4, 2024 12:13

PBMC68k

ac6c614

3k

fda383e

flying-sheep requested a review from ilan-gold June 4, 2024 11:20

Merge branch 'main' into doc-datasets

0cb201f

flying-sheep marked this pull request as ready for review June 4, 2024 11:21

ilan-gold approved these changes Jun 4, 2024

View reviewed changes

nicer scale

0caa293

flying-sheep merged commit 4f40d68 into main Jun 4, 2024
4 of 12 checks passed

flying-sheep deleted the doc-datasets branch June 4, 2024 13:36

meeseeksmachine mentioned this pull request Jun 4, 2024

Backport PR #3060 on branch 1.10.x (Document datasets) #3094

Merged

meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Jun 4, 2024

Backport PR scverse#3060: Document datasets

3168692

flying-sheep added a commit that referenced this pull request Jun 4, 2024

Backport PR #3060 on branch 1.10.x (Document datasets) (#3094)

d34e575

Co-authored-by: Philipp A <flying-sheep@web.de>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document datasets #3060

Document datasets #3060

flying-sheep commented May 14, 2024 •

edited

codecov bot commented May 14, 2024 •

edited

ilan-gold left a comment

flying-sheep commented May 17, 2024 •

edited

ilan-gold left a comment

ilan-gold Jun 4, 2024

flying-sheep Jun 4, 2024

Document datasets #3060

Document datasets #3060

Conversation

flying-sheep commented May 14, 2024 • edited

Rendered

codecov bot commented May 14, 2024 • edited

Codecov Report

ilan-gold left a comment

Choose a reason for hiding this comment

flying-sheep commented May 17, 2024 • edited

ilan-gold left a comment

Choose a reason for hiding this comment

ilan-gold Jun 4, 2024

Choose a reason for hiding this comment

flying-sheep Jun 4, 2024

Choose a reason for hiding this comment

flying-sheep commented May 14, 2024 •

edited

codecov bot commented May 14, 2024 •

edited

flying-sheep commented May 17, 2024 •

edited