migrate from `csv` files to `sqlite` databases for downstream use in queries #120

rfl-urbaniak · 2024-03-12T08:18:04Z

All csv data are now in two .db files for the two levels of analysis (counties and msa). Prior to deployment
of dbs to the polis server, these live locally. As the db files are now too large to store on GitHub, the user needs to
run csv_to_db_pipeline.py before the first use to generate the db locally.
The original DataGrabber classes have been refactored, renamed to ...CSV and decoupled from use dowstream.
The new DataGrabberDB class has been introduced and passed on to function as generic DataGrabber and MSADataGrabber.
Additional tests for DataGrabberDB have been introduced in test_data_grabber_sql. Additionally, DataGrabberDB under the generic alias passes all the tests that the original DataGrabber did.
generate_sql.ipynb (docs/experimental) contains performance tests for both approaches. At least in the current setting the orignal method is faster. The main culprit seems to be:

13    0.013    0.001    0.013    0.001 {method 'fetchall' of 'sqlite3.Cursor' objects}

This is not too surprising, after some reflection, as illustrated by this comment from ChatGPT:

As the ultimate tests of the switch to DB would involve data updates and model retraining, I leave the original .csv files and classes until those events. Keep in mind they are now not needed for queries to work correctly (they are needed to generate the .db files and for some of the tests).
The new pytest release leads to incompatibilities that might be worth investigating later. For now, fixed the pytest version to be 7.4.3. in setup.py.
Some cleaning scripts have been moved to a subfolder, which required a small refactoring of import statements in generic data cleaning pipeline scripts.
Incorrect indentation in a DataGrabber test has been fixed.
It turns out that isort with --profile black on the runner still works as if without this profile. Checked versions between local install and the one on the runner, the version numbers are the same. More people have similar issues, I suspended isort and decided to trust black, at least till stable isort 6.0 gets out.
Inference tests succeed with pyro-ppl==1.8.5 fail with pyro-ppl==1.9. For now fixed the version number in setup.py, but will think about investigating this deeper.

…h/cities into ru-sql

rfl-urbaniak · 2024-03-12T10:25:32Z

Have been running into this issue apparently.

PyCQA/isort#1518

…h/cities into ru-sql

rfl-urbaniak · 2024-03-12T18:09:08Z

still issues with what isort does between CI and locally, despite the version numbers being the same. Other people have faced this, I'm looking for a solution, but slowly considering using black w/o isort at least till switching to isort 6.0

PyCQA/isort#1889

… gws

rfl-urbaniak added 14 commits March 6, 2024 11:07

created sql db

268ee04

started sql migration

7df6f9e

conversion of csvs to db

c84a537

small speed test

50e93ba

data cleaning scripts migrated to a subfolder

02dd6ce

fixed pytest version at 7.4.3

d8cced8

data export from csv to db with test

c127307

fix indentation dg

9c8d80c

WIP

d2cb7d2

DataGrabberDB with tests

c2b731e

refactored DataGrabberCSV

e5f8a4b

passed DataGrabberDB downstream

5f942cf

performance tests

b46dc5a

removed vscode settings

648974e

rfl-urbaniak changed the base branch from main to staging-county-data March 12, 2024 08:21

lint with new mypy and pyro, ignore Adam mypy complaint

7c43efa

rfl-urbaniak changed the title ~~migrate from csv files to sqllite databases for downstream use in queries~~ migrate from csv files to sqlite databases for downstream use in queries Mar 12, 2024

rfl-urbaniak added 7 commits March 12, 2024 09:46

added staging-* to workflow

79f74d7

force the most recent version of isort in setup

1312ff6

Merge branch 'staging-county-data' of https://github.com/BasisResearc…

2c5a8b1

…h/cities into ru-sql

typo in isort import

e5919f1

isort modeling_utils.py

b3664ce

switch to --apply within isort in clean.sh

7f47359

removed --apply as redundant

41cda1f

rfl-urbaniak added the status : WIP label Mar 12, 2024

rfl-urbaniak added 4 commits March 12, 2024 11:04

upgrade black

839a4ab

add black profile to scripts

88bc3ef

removed black from nbqa

99ddeb0

dealing with linter versions

0a37867

rfl-urbaniak added 7 commits March 12, 2024 14:53

revising workflow

475c732

db pipeline to workflow

a9bc78d

suspend black to avoid linting version issues

d5062a1

Merge branch 'staging-county-data' of https://github.com/BasisResearc…

4764f3d

…h/cities into ru-sql

decouple db pipeline from data grabber

a0f97a5

lint

c13a431

runner isort recommendations by hand

1885390

rfl-urbaniak added 9 commits March 12, 2024 19:12

suspend isort switch to black

8e23fd3

switch to dev install (as torch is required to test inference now)

f7fb0ec

suspend notebook tests

cfde8bc

typo in test yaml

5ca57e1

remove parallel testing as different tests are collected at different…

be24f85

… gws

fixing test.yml

71bc2b9

fixed pyro version to 1.8.5

a0c0526

removed redundant code from test_inference

9fe9130

format lint

74cb94d

rfl-urbaniak added status: awaiting review and removed status : WIP labels Mar 13, 2024

rfl-urbaniak requested a review from riadas March 13, 2024 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrate from `csv` files to `sqlite` databases for downstream use in queries #120

migrate from `csv` files to `sqlite` databases for downstream use in queries #120

rfl-urbaniak commented Mar 12, 2024 •

edited

rfl-urbaniak commented Mar 12, 2024

rfl-urbaniak commented Mar 12, 2024

migrate from csv files to sqlite databases for downstream use in queries #120

Are you sure you want to change the base?

migrate from csv files to sqlite databases for downstream use in queries #120

Conversation

rfl-urbaniak commented Mar 12, 2024 • edited

rfl-urbaniak commented Mar 12, 2024

rfl-urbaniak commented Mar 12, 2024

migrate from `csv` files to `sqlite` databases for downstream use in queries #120

migrate from `csv` files to `sqlite` databases for downstream use in queries #120

rfl-urbaniak commented Mar 12, 2024 •

edited