Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate from csv files to sqlite databases for downstream use in queries #120

Open
wants to merge 42 commits into
base: staging-county-data
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
268ee04
created sql db
rfl-urbaniak Mar 6, 2024
7df6f9e
started sql migration
rfl-urbaniak Mar 6, 2024
c84a537
conversion of csvs to db
rfl-urbaniak Mar 6, 2024
50e93ba
small speed test
rfl-urbaniak Mar 8, 2024
02dd6ce
data cleaning scripts migrated to a subfolder
rfl-urbaniak Mar 9, 2024
d8cced8
fixed pytest version at 7.4.3
rfl-urbaniak Mar 9, 2024
c127307
data export from csv to db with test
rfl-urbaniak Mar 9, 2024
9c8d80c
fix indentation dg
rfl-urbaniak Mar 10, 2024
d2cb7d2
WIP
rfl-urbaniak Mar 10, 2024
c2b731e
DataGrabberDB with tests
rfl-urbaniak Mar 10, 2024
e5f8a4b
refactored DataGrabberCSV
rfl-urbaniak Mar 10, 2024
5f942cf
passed DataGrabberDB downstream
rfl-urbaniak Mar 10, 2024
b46dc5a
performance tests
rfl-urbaniak Mar 12, 2024
648974e
removed vscode settings
rfl-urbaniak Mar 12, 2024
7c43efa
lint with new mypy and pyro, ignore Adam mypy complaint
rfl-urbaniak Mar 12, 2024
79f74d7
added staging-* to workflow
rfl-urbaniak Mar 12, 2024
1312ff6
force the most recent version of isort in setup
rfl-urbaniak Mar 12, 2024
2c5a8b1
Merge branch 'staging-county-data' of https://github.com/BasisResearc…
rfl-urbaniak Mar 12, 2024
e5919f1
typo in isort import
rfl-urbaniak Mar 12, 2024
b3664ce
isort modeling_utils.py
rfl-urbaniak Mar 12, 2024
7f47359
switch to --apply within isort in clean.sh
rfl-urbaniak Mar 12, 2024
41cda1f
removed --apply as redundant
rfl-urbaniak Mar 12, 2024
839a4ab
upgrade black
rfl-urbaniak Mar 12, 2024
88bc3ef
add black profile to scripts
rfl-urbaniak Mar 12, 2024
99ddeb0
removed black from nbqa
rfl-urbaniak Mar 12, 2024
0a37867
dealing with linter versions
rfl-urbaniak Mar 12, 2024
475c732
revising workflow
rfl-urbaniak Mar 12, 2024
a9bc78d
db pipeline to workflow
rfl-urbaniak Mar 12, 2024
d5062a1
suspend black to avoid linting version issues
rfl-urbaniak Mar 12, 2024
4764f3d
Merge branch 'staging-county-data' of https://github.com/BasisResearc…
rfl-urbaniak Mar 12, 2024
a0f97a5
decouple db pipeline from data grabber
rfl-urbaniak Mar 12, 2024
c13a431
lint
rfl-urbaniak Mar 12, 2024
1885390
runner isort recommendations by hand
rfl-urbaniak Mar 12, 2024
8e23fd3
suspend isort switch to black
rfl-urbaniak Mar 12, 2024
f7fb0ec
switch to dev install (as torch is required to test inference now)
rfl-urbaniak Mar 12, 2024
cfde8bc
suspend notebook tests
rfl-urbaniak Mar 12, 2024
5ca57e1
typo in test yaml
rfl-urbaniak Mar 12, 2024
be24f85
remove parallel testing as different tests are collected at different…
rfl-urbaniak Mar 12, 2024
71bc2b9
fixing test.yml
rfl-urbaniak Mar 12, 2024
a0c0526
fixed pyro version to 1.8.5
rfl-urbaniak Mar 12, 2024
9fe9130
removed redundant code from test_inference
rfl-urbaniak Mar 13, 2024
74cb94d
format lint
rfl-urbaniak Mar 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
39 changes: 39 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Lint

on:
push:
branches: [ main ]
pull_request:
branches: [ main, staging-* ]
workflow_dispatch:

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10']

steps:
- uses: actions/checkout@v2

- name: pip cache
uses: actions/cache@v1
with:
path: ~/.cache/pip
key: lint-pip-${{ hashFiles('**/pyproject.toml') }}
restore-keys: |
lint-pip-

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .[test]

- name: Lint
run: ./scripts/lint.sh
35 changes: 0 additions & 35 deletions .github/workflows/python-app.yml

This file was deleted.

59 changes: 59 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Test

on:
push:
branches: [ main ]
pull_request:
branches: [ main, staging-* ]
workflow_dispatch:

jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
python-version: ['3.10']
os: [ubuntu-latest] # , macos-latest]

steps:
- uses: actions/checkout@v2
- name: Ubuntu cache
uses: actions/cache@v1
if: startsWith(matrix.os, 'ubuntu')
with:
path: ~/.cache/pip
key:
${{ matrix.os }}-${{ matrix.python-version }}-${{ hashFiles('**/pyproject.toml') }}
restore-keys: |
${{ matrix.os }}-${{ matrix.python-version }}-

- name: macOS cache
uses: actions/cache@v1
if: startsWith(matrix.os, 'macOS')
with:
path: ~/Library/Caches/pip
key:
${{ matrix.os }}-${{ matrix.python-version }}-${{ hashFiles('**/pyproject.toml') }}
restore-keys: |
${{ matrix.os }}-${{ matrix.python-version }}-

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .[dev]

- name: Generate databases
run: python cities/utils/csv_to_db_pipeline.py

- name: Test
run: python -m pytest tests/

- name: Test Notebooks
run: |
./scripts/test_notebooks.sh
16 changes: 0 additions & 16 deletions .vscode/launch.json

This file was deleted.

1 change: 1 addition & 0 deletions cities/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@

Project short description.
"""

__version__ = "0.0.1"
6 changes: 3 additions & 3 deletions cities/modeling/model_interactions.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
from typing import Optional

import dill
import pyro.distributions as dist
import torch

import pyro
import pyro.distributions as dist
from cities.modeling.modeling_utils import (
prep_wide_data_for_inference,
train_interactions_model,
Expand Down Expand Up @@ -50,12 +50,12 @@ def __init__(

self.model_args = self.data["model_args"]

self.model_conditioned = pyro.condition(
self.model_conditioned = pyro.condition( # type: ignore
self.model,
data={"T": self.data["t"], "Y": self.data["y"], "X": self.data["x"]},
)

self.model_rendering = pyro.render_model(
self.model_rendering = pyro.render_model( # type: ignore
self.model, model_args=self.model_args, render_distributions=True
)

Expand Down
8 changes: 4 additions & 4 deletions cities/modeling/modeling_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
import matplotlib.pyplot as plt
import pandas as pd
import torch
from pyro.infer import SVI, Trace_ELBO
from pyro.infer.autoguide import AutoNormal
from pyro.optim import Adam # type: ignore
from scipy.stats import spearmanr

import pyro
Expand All @@ -11,9 +14,6 @@
list_available_features,
list_tensed_features,
)
from pyro.infer import SVI, Trace_ELBO
from pyro.infer.autoguide import AutoNormal
from pyro.optim import Adam # type: ignore


def drop_high_correlation(df, threshold=0.85):
Expand Down Expand Up @@ -217,7 +217,7 @@ def train_interactions_model(
lr: float = 0.01,
):
guide = None
pyro.clear_param_store()
pyro.clear_param_store() # type: ignore

guide = AutoNormal(conditioned_model)

Expand Down
8 changes: 4 additions & 4 deletions cities/queries/causal_insight_slim.py
Original file line number Diff line number Diff line change
Expand Up @@ -500,10 +500,10 @@ def get_fips_predictions(
difference = (
self.predictions_original["observed"] - self.observed_outcomes_original
)
self.predictions_original[
["observed", "mean", "low", "high"]
] = self.predictions_original[["observed", "mean", "low", "high"]].sub(
difference, axis=0
self.predictions_original[["observed", "mean", "low", "high"]] = (
self.predictions_original[["observed", "mean", "low", "high"]].sub(
difference, axis=0
)
)

def plot_predictions(
Expand Down
12 changes: 6 additions & 6 deletions cities/queries/fips_query.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,9 +467,9 @@ def find_euclidean_kins(self):
if col.endswith(feature)
]
if _selected:
atemporal_aggregated_dict[
feature
] = atemporal_featurewise_contributions_df[_selected].sum(axis=1)
atemporal_aggregated_dict[feature] = (
atemporal_featurewise_contributions_df[_selected].sum(axis=1)
)

aggregated_atemporal_featurewise_contributions_df = pd.DataFrame(
atemporal_aggregated_dict
Expand All @@ -489,9 +489,9 @@ def find_euclidean_kins(self):
axis=1,
)
columns_to_normalize = self.aggregated_featurewise_contributions.iloc[:, 3:]
self.aggregated_featurewise_contributions.iloc[
:, 3:
] = columns_to_normalize.div(columns_to_normalize.sum(axis=1), axis=0)
self.aggregated_featurewise_contributions.iloc[:, 3:] = (
columns_to_normalize.div(columns_to_normalize.sum(axis=1), axis=0)
)

# some sanity checks
count = sum([1 for distance in distances if distance == 0])
Expand Down
2 changes: 1 addition & 1 deletion cities/utils/clean_variable.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import numpy as np
import pandas as pd

from cities.utils.clean_gdp import clean_gdp
from cities.utils.cleaning_scripts.clean_gdp import clean_gdp
from cities.utils.cleaning_utils import standardize_and_scale
from cities.utils.data_grabber import DataGrabber, find_repo_root

Expand Down
58 changes: 35 additions & 23 deletions cities/utils/cleaning_pipeline.py
Original file line number Diff line number Diff line change
@@ -1,26 +1,38 @@
from cities.utils.clean_age_composition import clean_age_composition
from cities.utils.clean_burdens import clean_burdens
from cities.utils.clean_ethnic_composition import clean_ethnic_composition
from cities.utils.clean_ethnic_composition_ma import clean_ethnic_composition_ma
from cities.utils.clean_gdp import clean_gdp
from cities.utils.clean_gdp_ma import clean_gdp_ma
from cities.utils.clean_hazard import clean_hazard
from cities.utils.clean_homeownership import clean_homeownership
from cities.utils.clean_income_distribution import clean_income_distribution
from cities.utils.clean_industry import clean_industry
from cities.utils.clean_industry_ma import clean_industry_ma
from cities.utils.clean_industry_ts import clean_industry_ts
from cities.utils.clean_population import clean_population
from cities.utils.clean_population_density import clean_population_density
from cities.utils.clean_population_ma import clean_population_ma
from cities.utils.clean_spending_commerce import clean_spending_commerce
from cities.utils.clean_spending_HHS import clean_spending_HHS
from cities.utils.clean_spending_transportation import clean_spending_transportation
from cities.utils.clean_transport import clean_transport
from cities.utils.clean_unemployment import clean_unemployment
from cities.utils.clean_urbanicity_ma import clean_urbanicity_ma
from cities.utils.clean_urbanization import clean_urbanization
from cities.utils.cleaning_poverty import clean_poverty
from cities.utils.cleaning_scripts.clean_age_composition import clean_age_composition
from cities.utils.cleaning_scripts.clean_burdens import clean_burdens
from cities.utils.cleaning_scripts.clean_ethnic_composition import (
clean_ethnic_composition,
)
from cities.utils.cleaning_scripts.clean_ethnic_composition_ma import (
clean_ethnic_composition_ma,
)
from cities.utils.cleaning_scripts.clean_gdp import clean_gdp
from cities.utils.cleaning_scripts.clean_gdp_ma import clean_gdp_ma
from cities.utils.cleaning_scripts.clean_hazard import clean_hazard
from cities.utils.cleaning_scripts.clean_homeownership import clean_homeownership
from cities.utils.cleaning_scripts.clean_income_distribution import (
clean_income_distribution,
)
from cities.utils.cleaning_scripts.clean_industry import clean_industry
from cities.utils.cleaning_scripts.clean_industry_ma import clean_industry_ma
from cities.utils.cleaning_scripts.clean_industry_ts import clean_industry_ts
from cities.utils.cleaning_scripts.clean_population import clean_population
from cities.utils.cleaning_scripts.clean_population_density import (
clean_population_density,
)
from cities.utils.cleaning_scripts.clean_population_ma import clean_population_ma
from cities.utils.cleaning_scripts.clean_spending_commerce import (
clean_spending_commerce,
)
from cities.utils.cleaning_scripts.clean_spending_HHS import clean_spending_HHS
from cities.utils.cleaning_scripts.clean_spending_transportation import (
clean_spending_transportation,
)
from cities.utils.cleaning_scripts.clean_transport import clean_transport
from cities.utils.cleaning_scripts.clean_unemployment import clean_unemployment
from cities.utils.cleaning_scripts.clean_urbanicity_ma import clean_urbanicity_ma
from cities.utils.cleaning_scripts.clean_urbanization import clean_urbanization
from cities.utils.cleaning_scripts.cleaning_poverty import clean_poverty

# from cities.utils.clean_health import clean_health

Expand Down
File renamed without changes.