New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrate from csv
files to sqlite
databases for downstream use in queries
#120
Open
rfl-urbaniak
wants to merge
42
commits into
staging-county-data
Choose a base branch
from
ru-sql
base: staging-county-data
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rfl-urbaniak
changed the title
migrate from
migrate from Mar 12, 2024
csv
files to sqllite
databases for downstream use in queriescsv
files to sqlite
databases for downstream use in queries
Have been running into this issue apparently. |
still issues with what isort does between CI and locally, despite the version numbers being the same. Other people have faced this, I'm looking for a solution, but slowly considering using black w/o isort at least till switching to isort 6.0 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
All csv data are now in two
.db
files for the two levels of analysis (counties and msa). Prior to deploymentof dbs to the polis server, these live locally. As the db files are now too large to store on GitHub, the user needs to
run
csv_to_db_pipeline.py
before the first use to generate the db locally.The original
DataGrabber
classes have been refactored, renamed to...CSV
and decoupled from use dowstream.The new
DataGrabberDB
class has been introduced and passed on to function as genericDataGrabber
andMSADataGrabber
.Additional tests for
DataGrabberDB
have been introduced intest_data_grabber_sql
. Additionally,DataGrabberDB
under the generic alias passes all the tests that the originalDataGrabber
did.generate_sql.ipynb
(docs/experimental
) contains performance tests for both approaches. At least in the current setting the orignal method is faster. The main culprit seems to be:This is not too surprising, after some reflection, as illustrated by this comment from ChatGPT:
As the ultimate tests of the switch to DB would involve data updates and model retraining, I leave the original
.csv
files and classes until those events. Keep in mind they are now not needed for queries to work correctly (they are needed to generate the.db
files and for some of the tests).The new
pytest
release leads to incompatibilities that might be worth investigating later. For now, fixed thepytest
version to be7.4.3.
insetup.py
.Some cleaning scripts have been moved to a subfolder, which required a small refactoring of import statements in generic data cleaning pipeline scripts.
Incorrect indentation in a
DataGrabber
test has been fixed.It turns out that
isort
with --profile black on the runner still works as if without this profile. Checked versions between local install and the one on the runner, the version numbers are the same. More people have similar issues, I suspended isort and decided to trust black, at least till stableisort 6.0
gets out.Inference tests succeed with
pyro-ppl==1.8.5
fail withpyro-ppl==1.9
. For now fixed the version number insetup.py
, but will think about investigating this deeper.