Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(3/3) Create one-time trigger all historical landing page - fetch all stories #25

Closed
19 of 31 tasks
rivernews opened this issue Aug 21, 2022 · 4 comments
Closed
19 of 31 tasks

Comments

@rivernews
Copy link
Owner

rivernews commented Aug 21, 2022

Better way to run them all

  • Stream-base seems the best. But SQS is hard to use.
    • Enqueue at landing page level or story page? Landing page. We may optionally store stories?
    • DynamoDB seems like a good option over SQS because it's stable, and our use case really doesn't leverage the strength of SQS tbh.
    • Besides data storage we still need a way to "drain" all TODO landing pages + A one-time operation to "enPool" all historic landing page (for Prod this means ALL).
      • Draining Option 1: DynamoDB/anyStoragePool + recursive/chained Sfn. Sfn execution uses landing page timestamp as name.
      • Draining Option 2: cronjob trigger Sfn every hour ... nice thing: fetching spread over a larger time span. Draining mechanism #33
        • S3 landing.html trigger - just write into DDB Draft for put landing page; identified TODOs #34. Move the metadata computing part into new cronjob below.
          • Create a new lambda, write DDB. Let S3 trigger switch to this lambda
        • Create a cronjob - lambda, which reads from DDB - just take out one landing page URL at this point (but make it extendable to iterate N landing pages), then do the metadata computing.
          • Point new cronjob to the previous (landing) metadata lambda
          • In metadata lambda, switch the source of landingPageUrl, from s3Event to pulling from DDB. (do a query; can limit=1 for our "slow start" purpose) Completed tf surgery; Identify all TODOs in golang #35
          • Adding events. This did make it more complicate as we need to figure out how to add to list in DDB.
        • Metadata computing lambda - add update to isMetadataEverComputed
        • Sfn add a step after map - log event "FinishStoriesFetchingAll" into DDB landing page object pipelineEvents.
        • Test the entire flow. Need to come up with a test plan first.
        • Ready in prod: one time batch processing
    • Other idea: purely S3: move processed landing page under another dir?
  • We may use Slack command as a manual trigger for event driven aside from S3.

Reference

Proper Throttling

It'll be best to reuse the Sfn, but limit the amount of concurrent sfn execution; overall we should aim at 5~100 concurrent lambdas but nothing more. Ideal if we can throttle <1 request / 2s.

But to truly keep a low profile it's best to span the time across hours, if not days.

Moving forward

Daily cronjob should automatically trigger our new S3-driven pipeline. Any other concerns?

  • Staging drill: parameter all switch to prod; then copy all prod landing page over to dev S3 bucket to test.

    • Sfn is not scaling up to 200, why? Because the landing page didn't have enough stories! Look at the db, it's 88 only so.
  • Improvements - only do low hanging fruit at this point! Don't do complicate task

    • Unexpected buggy behavior
    • Better execution insights
    • Easier to debug
      • We might get banned by Slack API. Especially, when S3 batch copy you trigger S3 event all at once. Can we not show log from it if it's stable? Or at least not log to Slack (but log to CloudWatch) Landing S3 trigger disable slack log & only log to cloudWatch #40
      • Add lambda invocation id (request id) to event description, will help pin the log
      • Add env tag in log, so we can tell especially in slack, whether it's prod or dev resource. Add env to lambda logger #42
      • Can we optimize our log msg?
      • Better way to query "metadata processed" landing items? Or even better, a field lastEventName to query (but then it could be similar to scan)? Or just a opposite to isDocTypeWaitingForMetadata, like isDocTypeMetadataDone.
      • Better way to query all stories items associate with a landing item?
      • Sfn map improvement: pre-determine wait time and put in Sfn input so it's clearer.
    • Stronger feature
      • Fast track - disable change detection for now - story can we de-dup for now? If story html already in html, skip it. It'll significantly boost our first-time processing. Skip story fetch if S3 already exists  #41
      • Detect change & censorship: A lot of stories are duplicates [because landing page fetch per 12h and hasn't changed much in between]... do we still fetch them? I guess we better do, but we want to store them all, not overwrite each other.
    • Save $$
      • can we move the "random wait" logic into Sfn wait? Seems Sfn wait doesn't charge?
      • Can we lower function memory to save? Currently 128 MB provisioned, used 4xMB.
  • Do the same above for prod (ready!)

@rivernews rivernews changed the title Create one-time trigger all historical landing page - fetch all stories (3/3) Create one-time trigger all historical landing page - fetch all stories Aug 21, 2022
@rivernews
Copy link
Owner Author

rivernews commented Sep 19, 2022

DynamoDB Modeling

Primary table: just UUID

Landing page table:

  • fetchedTimeStamp
  • newsSite Alias, etc, or reference to newsSite object model
  • isMetadataEverComputed <-- this can be used for query

Action items

  • Use a separate Terraform module for separate of concern: Refactor to allow individual tf modules #30
    • Determine a data store to bridge the two module.
  • Terraform provision a dynamodb table.
    • We may create a GSI early on. Currently isMetadataEverComputed is the important one.

@rivernews
Copy link
Owner Author

rivernews commented Sep 28, 2022

Test the entire pipeline

  • First of all, destroy any tf table resources.
  • Provision table
  • Provision media stack
  • Invoke landing lambda manually in AWS portal, so it can fetch one single landing HTML, and create DDB item.
  • Invoke landing_metadata_cronjob lambda manually. Observe DDB query successfully getting landing page item, and output the metadata.
    • Confirm DDB landing page item: isDocTypeWaitingForMetadata removed, event added
      • Add index permission to any lambda that uses DDB query; see this SO, aside from table ARN, also have to add index to resource list. 233ce60
      • S3 pull error
      • DDB list_append only works with two lists, not one list one item.
      • DDB UpdateItem - must specify both PK and sort key
    • Confirm metadata generated in S3
  • Observe S3 trigger by metadata, launch sfn in stories
    • Confirm DDB item event added
  • Observe Sfn map executed story in parallel, and finalizer lambda stories_finalizer executed
    • Confirm DDB story items created, event added
    • Confirm DDB landing page item event properly added
    • Confirm S3 story stored in proper key directory. Previously there are suspicious stories-... prefix there that shouldn't be there, check the dev bucket.

@rivernews rivernews mentioned this issue Sep 29, 2022
rivernews added a commit that referenced this issue Sep 29, 2022
rivernews added a commit that referenced this issue Sep 29, 2022
rivernews added a commit that referenced this issue Sep 29, 2022
rivernews added a commit that referenced this issue Sep 29, 2022
* Ready to test

* Fix db field first char not lowercase
Tracked by #25 (comment)

* Fix permission of db index, S3 pull
Tracked by #25 (comment)

* All tests complete
Tracked by #25 (comment)
rivernews added a commit that referenced this issue Sep 29, 2022
* Complete tf surgery; Identify all TODOs in golang
For #25

* fix compile error; progress in metadata cronjob add query

* Ready to test (#36)

* Ready to test

* Fix db field first char not lowercase
Tracked by #25 (comment)

* Fix permission of db index, S3 pull
Tracked by #25 (comment)

* All tests complete
Tracked by #25 (comment)
rivernews added a commit that referenced this issue Sep 29, 2022
* Draft for put landing page; identified TODOs
Issue: #25

* Completed tf surgery; Identify all TODOs in golang (#35)

* Complete tf surgery; Identify all TODOs in golang
For #25

* fix compile error; progress in metadata cronjob add query

* Ready to test (#36)

* Ready to test

* Fix db field first char not lowercase
Tracked by #25 (comment)

* Fix permission of db index, S3 pull
Tracked by #25 (comment)

* All tests complete
Tracked by #25 (comment)
rivernews added a commit that referenced this issue Sep 29, 2022
* Draining mechanism draft - identify all TODOs
#25 (comment)

* Draft for put landing page; identified TODOs (#34)

* Draft for put landing page; identified TODOs
Issue: #25

* Completed tf surgery; Identify all TODOs in golang (#35)

* Complete tf surgery; Identify all TODOs in golang
For #25

* fix compile error; progress in metadata cronjob add query

* Ready to test (#36)

* Ready to test

* Fix db field first char not lowercase
Tracked by #25 (comment)

* Fix permission of db index, S3 pull
Tracked by #25 (comment)

* All tests complete
Tracked by #25 (comment)
@rivernews
Copy link
Owner Author

rivernews commented Sep 29, 2022

One time batch processing

Better build a tool that would be useful later in the future.

Basically: turn S3 object(s) into a brand new DDB item.

  • Generate brand new DDB item: logic already there in landing
  • Scan S3 dir at scale. To be useful in the future, we better make it flexible
    • Because we store headlines as s3://media-literacy-dev-archives/{redacted}/daily-headlines/2021-08-21T00:11:53Z/..., we can probably scan by time range.

To kick start,

  • How to design a pipeline that is useful in the future?
    • Slack command interface
    • Local CLI tool

Simplest way to do it?

Avoid writing unnecessary code. This one-time thing is going to be used really rare after this first trigger. Leverage the S3 trigger + "move/copy" feature in S3 bucket. The flow could be like:

  • Move DB PutItem from landing to another lambda. S3 trigger landing.html invoke this lambda.
    • Fix lambda package too large (> 50MB), have to fix lambda build process. Look at TF Lambda API Doc.
  • TF Switch to s3:ObjectCreated:Put.
  • Move all landing page out to a temp place
  • TF Switch to s3:ObjectCreated:Copy
  • Move all landing page back to original place -> this will S3 trigger all! But not yet fetch, just put all in DB. You may do this in dev first before going into prod, since no scraping happened yet.
    • Check DB, landing page entries number seems right?
  • We didn't manually invoke metadata cronjob yet, why it fired anyway as soon as we copy the landing files? Because we're copying over metadata.json as well. Now: just copy the landing.html one by one, don't copy by the entire directory!
    • When we test, we should just move one single landing dir?
  • Sfn finalizer failed, why?
🛑 ERROR: operation error DynamoDB: Query, https response error StatusCode: 400, 
RequestID: 653RRF4ESDBN7P72AER3M9DAURVV4KQNSO5AEMVJF66Q9ASUAAJG, api error 
ValidationException: One or more parameter values are not valid. 
A value specified for a secondary index key is not supported. 
The AttributeValue for a key attribute cannot contain an empty string value. 
IndexName: s3KeyIndex, IndexKey: s3Key | 2022/09/30 08:57:47

@rivernews
Copy link
Owner Author

There are quite big cost implication, however we don't know exact the amount of $$ we need to pay yet. But moving forward it's time to think about the fast track issues and cost saving issues. We should have another issue address these, since they are out of scope and no longer about achieving one-time batch processing.

For now, we will disable cronjob and pause the pipeline. Next time, we may copy over the stories to prod for reuse. Once we have the fast track feature #41, those will be skipped and we won't lose the computation outcome of these days.

rivernews added a commit that referenced this issue Oct 2, 2022
* temp store all

* remove go_poc

* upgrade so project runs on M1

* Try S3 notification

* Fix prefix to include newssite alias

* Fix aws lambda PathError issue

* Save to metadata.json complete

* add untitled stories in metadata.json

* rename stories function to landing_metadata

* rename batch stories fetch tf to metadata

* Improved metadata access s3 event

* Metadata.json trigger computing env

* read parse metadata.json

* fetch a story POC
#24

* Sfn map parallism POC
#24

* randomize requests

* Refactor to allow individual tf modules
address #25 (comment)

* scaffold table

* draft table design

* create table

* Draining mechanism draft - identify all TODOs
#25 (comment)

* Draft for put landing page; identified TODOs
Issue: #25

* Complete tf surgery; Identify all TODOs in golang
For #25

* fix compile error; progress in metadata cronjob add query

* Ready to test

* Fix db field first char not lowercase
Tracked by #25 (comment)

* Fix permission of db index, S3 pull
Tracked by #25 (comment)

* All tests complete
Tracked by #25 (comment)

* Move landing PutItem out to s3 trigger lambda; ready for S3 batch move

* create reusable lambda module; optimize package size
#25 (comment)

* Fix golang build path

* Refactor to use our custom lambda module

* add landing s3 trigger

* rm golang module stories that are renamed

* Fix env var

* Fix permission for PutItem move from landing to s3 trigger

* Fix metadata s3 trigger not fired

* Fix s3 trigger not working - S3 notification can only have one resource

* Make it easier to test

* prod grade setting enabled

* In Sfn pin lambda version, so rolling deploy works better for lambda

* Display sfn map result / target stories count info in finalizer

* stop landing s3 trigger from sending slack logs
Fixes #40

* Let Sfn pin lambda version
Fixes #39

* improve log for metadata trigger

* improve cronjob log

* log cronjob event for better understanding of how it get triggered

* Disable cronjob to better debug
Fixes #43

* workaround to scale up our Sfn pipeline
Fix #44

* improve log for landing S3 trigger

* re-enable prod config plus cronjob
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant