Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue using read_csv in a model file for a large CSV being converted to a parquet file #193

Open
jaanli opened this issue Jun 23, 2023 · 6 comments

Comments

@jaanli
Copy link

jaanli commented Jun 23, 2023

Hi @jwills ! Thank you for this amazing work!! I've used it to teach datathinking.org at University of Tartu and @PrincetonUniversity.

However, I'm getting stuck in switching from a local CSV file to a remote one; here is an example model file:

https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/datathinking/models/new_york_city_311_calls.sql

This duckdb code works in notebooks such as this one:

https://nbviewer.org/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb

This takes a large csv file from NYC about phone calls:

https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD

and exports it to a parquet file:

https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.parquet

Before visualization.

I added the output of dbt run using the above model here:

https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/dbt_output/new_york_city_311_calls.duckdb

This .duckdb database is empty:

 me@laptop   ~/dropbox/projects/data_build_tool_for_datathinking.org/dbt_output     main  duckdb new_york_city_311_calls.duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D DESCRIBE;
┌──────────┬─────────┬────────────┬──────────────┬──────────────┬───────────┐
│ database │ schema  │ table_name │ column_names │ column_types │ temporary │
│ varchar  │ varchar │  varchar   │  varchar[]   │  varchar[]   │  boolean  │
├───────────────────────────────────────────────────────────────────────────┤
│                                  0 rows                                   │
└───────────────────────────────────────────────────────────────────────────┘
D
image

I'm not sure how to debug this -- any ideas?

Thank you so much!! (This course is free & open access for anyone to learn data thinking with the help of GPT and the latest AI tools; we work with a variety of community organizations as well so any help goes a long way :)

@jwills
Copy link
Collaborator

jwills commented Jun 23, 2023

Hey @jaanli, thanks for the nice note and for the teaching you do! So I cloned and ran the project you linked to locally without any issues afaict:

jwills@Joshs-MBP dev % git clone git@github.com:onefact/data_build_tool_for_datathinking.org.git
Cloning into 'data_build_tool_for_datathinking.org'...
remote: Enumerating objects: 41, done.
remote: Counting objects: 100% (41/41), done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 41 (delta 9), reused 33 (delta 5), pack-reused 0
Receiving objects: 100% (41/41), 17.93 KiB | 1.99 MiB/s, done.
Resolving deltas: 100% (9/9), done.
jwills@Joshs-MBP dev % cd data_build_tool_for_datathinking.org
jwills@Joshs-MBP data_build_tool_for_datathinking.org % ls
LICENSE		README.md	datathinking	dbt_output
jwills@Joshs-MBP data_build_tool_for_datathinking.org % cd datathinking
jwills@Joshs-MBP datathinking % ls
README.md	analyses	dbt_project.yml	macros		models		profiles.yml	seeds		snapshots	tests
jwills@Joshs-MBP datathinking % dbt debug
16:51:47  Running with dbt=1.5.0
16:51:47  dbt version: 1.5.0
16:51:47  python version: 3.11.3
16:51:47  python path: /opt/homebrew/opt/python@3.11/bin/python3.11
16:51:47  os info: macOS-13.4-arm64-arm-64bit
16:51:47  Using profiles.yml file at /Users/jwills/dev/data_build_tool_for_datathinking.org/datathinking/profiles.yml
16:51:47  Using dbt_project.yml file at /Users/jwills/dev/data_build_tool_for_datathinking.org/datathinking/dbt_project.yml
16:51:47  Configuration:
16:51:47    profiles.yml file [OK found and valid]
16:51:47    dbt_project.yml file [OK found and valid]
16:51:47  Required dependencies:
16:51:47   - git [OK found]

16:51:47  Connection:
16:51:47    database: new_york_city_311_calls
16:51:47    schema: main
16:51:47    path: /tmp/new_york_city_311_calls.duckdb
16:51:52    Connection test: [OK connection ok]

16:51:52  All checks passed!
jwills@Joshs-MBP datathinking % dbt run
16:51:57  Running with dbt=1.5.0
16:51:57  Unable to do partial parsing because saved manifest not found. Starting full parse.
16:51:57  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.datathinking.example
16:51:57  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 315 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
16:51:57
16:51:57  Concurrency: 1 threads (target='dev')
16:51:57
16:51:57  1 of 1 START sql view model main.new_york_city_311_calls ....................... [RUN]
16:51:57  1 of 1 OK created sql view model main.new_york_city_311_calls .................. [OK in 0.04s]
16:51:57
16:51:57  Finished running 1 view model in 0 hours 0 minutes and 0.13 seconds (0.13s).
16:51:57
16:51:57  Completed successfully
16:51:57
16:51:57  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
jwills@Joshs-MBP datathinking % ls -ltr /tmp
lrwxr-xr-x@ 1 root  wheel  11 May 12 15:29 /tmp -> private/tmp
jwills@Joshs-MBP datathinking % ls -ltr /tmp/new_york_city_311_calls.duckdb
-rw-r--r--  1 jwills  wheel  274432 Jun 23 09:51 /tmp/new_york_city_311_calls.duckdb
jwills@Joshs-MBP datathinking % duckdb /tmp/new_york_city_311_calls.duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
D show tables;
┌─────────────────────────┐
│          name           │
│         varchar         │
├─────────────────────────┤
│ new_york_city_311_calls │
└─────────────────────────┘

Note that the DuckDB database file I am loading is the one that is at /tmp/new_york_city_311_calls.duckdb, which is the path that is specified in profiles.yml right now on the master branch, so the first thing to check is to see if the view of that CSV file that you created is in that file instead? Assuming that isn't the issue (i.e., you have locally updated your profiles.yml so as to point at that dropbox directory you showed in your message), I would want to know a couple of things:

  1. What version of DuckDB you're running with (both for the CLI and for your Python installation, since they aren't necessarily the same),
  2. If the dbt run call works correctly if the path is set to /tmp/new_york_city_311_calls.duckdb instead of (what I have to assume is) a Dropbox-managed directory.

I'm not aware of any issues on writing to DuckDB database files in Dropbox-managed directories, but it's totally possible there is some conflict there that is messing things up for you.

@jaanli
Copy link
Author

jaanli commented Jun 24, 2023

Oh interesting, thank you so much @jwills !

Yup I see what you mean, my bad on being unclear in the original description.

I think I got to that stage too.

The specific issue now that you've helped clarify what may be going on:

There doesn't seem to be many parts of the data in https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD in the /tmp/new_york_city_311_calls.duckdb duckdb database file that appears after executing dbt run.

For example, I see the same output you do for show tables;, but I can't select any rows (this hangs indefinitely, at least 10-20 minutes of waiting so far!):

 me@laptop   ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking     main  duckdb /tmp/new_york_city_311_calls.duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D show tables;
┌─────────────────────────┐
│          name           │
│         varchar         │
├─────────────────────────┤
│ new_york_city_311_calls │
└─────────────────────────┘
D SELECT * FROM new_york_city_311_calls LIMIT 10;

Further, https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD is 18 gigabytes, but /tmp/new_york_city_311_calls.duckdb is 268 kilobytes so that is why something seems to be off here (or, more likely, I'm not understanding how dbt or duckdb work after having taught it and related concepts for over a semester!! lol :S):

 me@laptop   ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking     main  ls -lh /tmp/new_york_city_311_calls.duckdb
-rw-r--r--  1 me  wheel   268K Jun 24 12:48 /tmp/new_york_city_311_calls.duckdb

Is that helpful to maybe try pinpointing the issue a little further? Appreciate your help!!! Feels like we're getting closer 🍡

@jwills
Copy link
Collaborator

jwills commented Jun 24, 2023

Yeah, I noticed that your default materialization in the project was set to the dbt default, which is a view. That explains the small size of the DuckDB file you're creating in the project, as it is only storing the query in the DB, not the actual data. If you want it to pull all of the data over to your local instance, you need to configure the model to be materialized as a table.

The indefinite hang doesn't shock me either-- I just copy-pasted that nyc url into a browser, and it kicked off a download process that needs to pull down the 18GB before the query will be able to run against it. You'd almost certainly be better off pulling that data down once and storing it using another external source (like S3) which can start streaming the data to you immediately when a query is run.

Hope that helps a bit!

@jaanli
Copy link
Author

jaanli commented Jul 7, 2023

Thank you so much @jwills! That helps and I'm getting closer.

Changing the line in dbt_project.yml from materialized: view to materialized: table makes it run longer:

 me@laptop   ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking     main    dbt run
15:55:41  Running with dbt=1.5.0
15:55:41  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 313 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
15:55:41  
15:55:41  Concurrency: 4 threads (target='dev')
15:55:41  
15:55:41  1 of 1 START sql table model main.service_requests ............................. [RUN]
16:44:46  1 of 1 OK created sql table model main.service_requests ........................ [OK in 2944.05s]
16:44:46  
16:44:46  Finished running 1 table model in 0 hours 49 minutes and 4.12 seconds (2944.12s).
16:44:46  
16:44:46  Completed successfully
16:44:46  
16:44:46  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

However, it is missing 90% of the number of rows: there should be 33M rows instead of 2.5M:

 me@laptop   ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking     main    duckdb /tmp/service_requests.duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D SELECT COUNT(*) FROM service_requests;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      2538812 │
└──────────────┘
D 

The .duckdb file is also much smaller as expected:

me@laptop   ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking     main    ls -lh /tmp/service_requests.duckdb 
-rw-r--r--  1 me  wheel   189M Jul  7 12:44 /tmp/service_requests.duckdb

If I run wget https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.csv then it downloads the entire 18GB file, and if I execute this duckdb query then I get a 2.1GB parquet file instead of the above file of 189M (https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.parquet - created with https://nbviewer.org/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb).

Any idea how to debug this? I might be missing something basic again -- really grateful for your help! (Context: we're working with 4 nonprofits, some national, all will be using this stack so this infra is already going a long way!!)

@jwills
Copy link
Collaborator

jwills commented Jul 7, 2023

Mmm, okay-- still guessing here, but I'd be suspicious of the ignore_errors=True line in the read_csv config in this model (assuming that this is upstream of whatever the service_requests model is doing):

https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/datathinking/models/new_york_city_311_calls.sql

...like that makes me think that if DuckDB couldn't parse one of the lines of the CSV file for some reason (most likely a typing issue), then it would simply ignore it, leading to a smaller resulting data set.

@jaanli
Copy link
Author

jaanli commented Jul 7, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants