New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/reduce-and-fan-ngram-index #189
Conversation
Codecov Report
@@ Coverage Diff @@
## main #189 +/- ##
==========================================
- Coverage 94.60% 93.36% -1.24%
==========================================
Files 50 51 +1
Lines 2632 2669 +37
==========================================
+ Hits 2490 2492 +2
- Misses 142 177 +35
Continue to review full report at Codecov.
|
This is ready for review! There is a related cookiecutter-cdp-deployment PR here: CouncilDataProject/cookiecutter-cdp-deployment#108 You can see this pipeline in action here: https://github.com/JacksonMaxfield/cdp-dev/actions/runs/2491590958 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! This is a really cool way of optimizing the index pipeline!
Description of Changes
Include a description of the proposed changes.
This makes our indexing pipeline two different functions / bin scripts.
The first is to generate a lot of small index parquet chunks. The second then uploads a single chunk.
This is intended to be used with GitHub Actions where the first script runs, uploads all the index chunks to artifact files, then spawns a new GitHub action runner for each chunk uploaded to then upload that single chunk.
Gather -> Process -> Store -> Fan -> Upload