Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate from csv files to sqlite databases for downstream use in queries #120

Open
wants to merge 42 commits into
base: staging-county-data
Choose a base branch
from

Conversation

rfl-urbaniak
Copy link
Contributor

@rfl-urbaniak rfl-urbaniak commented Mar 12, 2024

  1. All csv data are now in two .db files for the two levels of analysis (counties and msa). Prior to deployment
    of dbs to the polis server, these live locally. As the db files are now too large to store on GitHub, the user needs to
    run csv_to_db_pipeline.py before the first use to generate the db locally.

  2. The original DataGrabber classes have been refactored, renamed to ...CSV and decoupled from use dowstream.

  3. The new DataGrabberDB class has been introduced and passed on to function as generic DataGrabber and MSADataGrabber.

  4. Additional tests for DataGrabberDB have been introduced in test_data_grabber_sql. Additionally, DataGrabberDB under the generic alias passes all the tests that the original DataGrabber did.

  5. generate_sql.ipynb (docs/experimental) contains performance tests for both approaches. At least in the current setting the orignal method is faster. The main culprit seems to be:

13    0.013    0.001    0.013    0.001 {method 'fetchall' of 'sqlite3.Cursor' objects}

This is not too surprising, after some reflection, as illustrated by this comment from ChatGPT:

Screenshot from 2024-03-12 08-44-32

  1. As the ultimate tests of the switch to DB would involve data updates and model retraining, I leave the original .csv files and classes until those events. Keep in mind they are now not needed for queries to work correctly (they are needed to generate the .db files and for some of the tests).

  2. The new pytest release leads to incompatibilities that might be worth investigating later. For now, fixed the pytest version to be 7.4.3. in setup.py.

  3. Some cleaning scripts have been moved to a subfolder, which required a small refactoring of import statements in generic data cleaning pipeline scripts.

  4. Incorrect indentation in a DataGrabber test has been fixed.

  5. It turns out that isort with --profile black on the runner still works as if without this profile. Checked versions between local install and the one on the runner, the version numbers are the same. More people have similar issues, I suspended isort and decided to trust black, at least till stable isort 6.0 gets out.

  6. Inference tests succeed with pyro-ppl==1.8.5 fail with pyro-ppl==1.9. For now fixed the version number in setup.py, but will think about investigating this deeper.

@rfl-urbaniak rfl-urbaniak changed the base branch from main to staging-county-data March 12, 2024 08:21
@rfl-urbaniak rfl-urbaniak changed the title migrate from csv files to sqllite databases for downstream use in queries migrate from csv files to sqlite databases for downstream use in queries Mar 12, 2024
@rfl-urbaniak
Copy link
Contributor Author

Have been running into this issue apparently.

PyCQA/isort#1518

@rfl-urbaniak
Copy link
Contributor Author

still issues with what isort does between CI and locally, despite the version numbers being the same. Other people have faced this, I'm looking for a solution, but slowly considering using black w/o isort at least till switching to isort 6.0

PyCQA/isort#1889

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant