Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/reduce-and-fan-ngram-index #189

Merged
merged 18 commits into from Jun 15, 2022
Merged

Conversation

evamaxfield
Copy link
Member

Description of Changes

Include a description of the proposed changes.

This makes our indexing pipeline two different functions / bin scripts.

The first is to generate a lot of small index parquet chunks. The second then uploads a single chunk.

This is intended to be used with GitHub Actions where the first script runs, uploads all the index chunks to artifact files, then spawns a new GitHub action runner for each chunk uploaded to then upload that single chunk.

Gather -> Process -> Store -> Fan -> Upload

@evamaxfield evamaxfield added bug Something isn't working enhancement New feature or request event index pipeline A feature of bugfix relating to indexing of events labels Jun 2, 2022
@evamaxfield evamaxfield self-assigned this Jun 2, 2022
@codecov
Copy link

codecov bot commented Jun 2, 2022

Codecov Report

Merging #189 (3f8c9bf) into main (c8b6b57) will decrease coverage by 1.23%.
The diff coverage is 34.09%.

@@            Coverage Diff             @@
##             main     #189      +/-   ##
==========================================
- Coverage   94.60%   93.36%   -1.24%     
==========================================
  Files          50       51       +1     
  Lines        2632     2669      +37     
==========================================
+ Hits         2490     2492       +2     
- Misses        142      177      +35     
Impacted Files Coverage Δ
...end/pipeline/process_event_index_chunk_pipeline.py 0.00% <0.00%> (ø)
cdp_backend/file_store/functions.py 88.09% <25.00%> (-6.65%) ⬇️
..._backend/pipeline/generate_event_index_pipeline.py 97.33% <88.23%> (ø)
cdp_backend/pipeline/pipeline_config.py 100.00% <100.00%> (ø)
...ackend/tests/pipeline/test_event_index_pipeline.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8b6b57...3f8c9bf. Read the comment docs.

@evamaxfield
Copy link
Member Author

This is ready for review! There is a related cookiecutter-cdp-deployment PR here: CouncilDataProject/cookiecutter-cdp-deployment#108

You can see this pipeline in action here: https://github.com/JacksonMaxfield/cdp-dev/actions/runs/2491590958

Copy link
Collaborator

@tohuynh tohuynh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

cdp_backend/pipeline/process_event_index_chunk_pipeline.py Outdated Show resolved Hide resolved
cdp_backend/pipeline/process_event_index_chunk_pipeline.py Outdated Show resolved Hide resolved
cdp_backend/pipeline/generate_event_index_pipeline.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@isaacna isaacna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! This is a really cool way of optimizing the index pipeline!

@evamaxfield evamaxfield merged commit e100f8c into main Jun 15, 2022
@evamaxfield evamaxfield deleted the feature/reduce-ngram-index branch June 15, 2022 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request event index pipeline A feature of bugfix relating to indexing of events
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants