-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue using read_csv
in a model file for a large CSV being converted to a parquet file
#193
Comments
Hey @jaanli, thanks for the nice note and for the teaching you do! So I cloned and ran the project you linked to locally without any issues afaict:
Note that the DuckDB database file I am loading is the one that is at
I'm not aware of any issues on writing to DuckDB database files in Dropbox-managed directories, but it's totally possible there is some conflict there that is messing things up for you. |
Oh interesting, thank you so much @jwills ! Yup I see what you mean, my bad on being unclear in the original description. I think I got to that stage too. The specific issue now that you've helped clarify what may be going on: There doesn't seem to be many parts of the data in https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD in the For example, I see the same output you do for
Further, https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD is 18 gigabytes, but
Is that helpful to maybe try pinpointing the issue a little further? Appreciate your help!!! Feels like we're getting closer 🍡 |
Yeah, I noticed that your default materialization in the project was set to the dbt default, which is a The indefinite hang doesn't shock me either-- I just copy-pasted that nyc url into a browser, and it kicked off a download process that needs to pull down the 18GB before the query will be able to run against it. You'd almost certainly be better off pulling that data down once and storing it using another external source (like S3) which can start streaming the data to you immediately when a query is run. Hope that helps a bit! |
Thank you so much @jwills! That helps and I'm getting closer. Changing the line in
However, it is missing 90% of the number of rows: there should be 33M rows instead of 2.5M:
The
If I run Any idea how to debug this? I might be missing something basic again -- really grateful for your help! (Context: we're working with 4 nonprofits, some national, all will be using this stack so this infra is already going a long way!!) |
Mmm, okay-- still guessing here, but I'd be suspicious of the ...like that makes me think that if DuckDB couldn't parse one of the lines of the CSV file for some reason (most likely a typing issue), then it would simply ignore it, leading to a smaller resulting data set. |
Yay! That was it thank you @jwills ! Here's an end-to-end example: https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/datathinking/models/cityofnewyork.us/service_requests.sql leads to https://public.datathinking.org/cityofnewyork.us%2Fservice_requests.parquet Code example of using this downstream for data analysis, visualization (and soon AI/ML :): Really grateful for the help!! |
Hi @jwills ! Thank you for this amazing work!! I've used it to teach datathinking.org at University of Tartu and @PrincetonUniversity.
However, I'm getting stuck in switching from a local CSV file to a remote one; here is an example model file:
https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/datathinking/models/new_york_city_311_calls.sql
This
duckdb
code works in notebooks such as this one:https://nbviewer.org/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb
This takes a large csv file from NYC about phone calls:
https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD
and exports it to a parquet file:
https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.parquet
Before visualization.
I added the output of
dbt run
using the above model here:https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/dbt_output/new_york_city_311_calls.duckdb
This
.duckdb
database is empty:I'm not sure how to debug this -- any ideas?
Thank you so much!! (This course is free & open access for anyone to learn data thinking with the help of GPT and the latest AI tools; we work with a variety of community organizations as well so any help goes a long way :)
The text was updated successfully, but these errors were encountered: