Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ability to flag Events as "try to scrape again next time" and do so on the next CRON run #212

Open
smai-f opened this issue Sep 11, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@smai-f
Copy link

smai-f commented Sep 11, 2022

Feature Description

A way to flag single events as always needing to be scraped again the next time the event gather CRON action runs, regardless of the datetime parameters passed in.

Use Case

We've another layer of need above our need for processing only part of a video. We know that someone manually adds the timestamps to denote the timestamp range for the part of video a few days after the hearings are uploaded onto the legislature website. It's very conceivable that when we scrape on our CRON schedule, the timestamps wouldn't be there yet.

If I understand correctly, if the timestamps don't exist our options would be to:

  • fall back on scraping the whole video, which we don't want as they can be several hours long (dealbreaker)
  • skip the video and don't ingest it, but then the next CRON run the video would not get picked up again (dealbreaker)
  • ignore the datetime range passed in, and scrape everything every time

Solution

Since the scraper runs via datetime range and doesn't really understand things in the unit of a single event, the MVP of this could be if any events within a datetime range flag needing to be retried, get_events is called for that same datetime range again on the next run.

Ideally, it wouldn't have to re-scrape a whole daterange though and we could provide a function/lambda/callback to run for just the event that needs to be revisited on the next run and it would keep happening until conditions were met for the video to be ingested.

Alternatives

As mentioned above we can overcome this by not using the datetime parameters passed to get_events and scrape everything all the time, and ignore videos that do not have timestamps in the meantime. Scraping everything all the time will probably not be ideal as the number of bills and hearings grow, particularly because our legislature site was literally built in the 90s and hits random errors a lot.

Another idea is we could only programmatically scrape videos that have timestamps within the datetime range, and come back and manually run event gather for the datetime range a few days/weeks after the videos have been added to try again for the timestamps. Also not ideal but could be a backup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant