(3/3) Create one-time trigger all historical landing page - fetch all stories #25

rivernews · 2022-08-21T22:23:20Z

Better way to run them all

Stream-base seems the best. But SQS is hard to use.
- Enqueue at landing page level or story page? Landing page. We may optionally store stories?
- DynamoDB seems like a good option over SQS because it's stable, and our use case really doesn't leverage the strength of SQS tbh.
  - DynamoDB terraform provider has GSI concerns, see: No method to ignore changes in DynamoDB GSI hashicorp/terraform-provider-aws#671; but since GSI is like a redundancy of base table, deleting GSI won't cause data loss on base table. Most of the concern in the issue is interrupting prod database availability when applying the change. Since we're not at that live scale yet so this is not an issue on our side. Plus we are event driven workflow.
  - Create DDB table that serves our purpose, tracked by (3/3) Create one-time trigger all historical landing page - fetch all stories #25 (comment)
- Besides data storage we still need a way to "drain" all TODO landing pages + A one-time operation to "enPool" all historic landing page (for Prod this means ALL).
  - Draining Option 1: DynamoDB/anyStoragePool + recursive/chained Sfn. Sfn execution uses landing page timestamp as name.
  - Draining Option 2: cronjob trigger Sfn every hour ... nice thing: fetching spread over a larger time span. Draining mechanism #33
    - S3 landing.html trigger - just write into DDB Draft for put landing page; identified TODOs #34. Move the metadata computing part into new cronjob below.
      - Create a new lambda, write DDB. Let S3 trigger switch to this lambda
    - Create a cronjob - lambda, which reads from DDB - just take out one landing page URL at this point (but make it extendable to iterate N landing pages), then do the metadata computing.
      - Point new cronjob to the previous (landing) metadata lambda
      - In metadata lambda, switch the source of landingPageUrl, from s3Event to pulling from DDB. (do a query; can limit=1 for our "slow start" purpose) Completed tf surgery; Identify all TODOs in golang #35
      - Adding events. This did make it more complicate as we need to figure out how to add to list in DDB.
    - Metadata computing lambda - add update to isMetadataEverComputed
    - Sfn add a step after map - log event "FinishStoriesFetchingAll" into DDB landing page object pipelineEvents.
    - Test the entire flow. Need to come up with a test plan first.
    - Ready in prod: one time batch processing
- Other idea: purely S3: move processed landing page under another dir?
We may use Slack command as a manual trigger for event driven aside from S3.

Reference

Proper Throttling

It'll be best to reuse the Sfn, but limit the amount of concurrent sfn execution; overall we should aim at 5~100 concurrent lambdas but nothing more. Ideal if we can throttle <1 request / 2s.

But to truly keep a low profile it's best to span the time across hours, if not days.

Moving forward

Daily cronjob should automatically trigger our new S3-driven pipeline. Any other concerns?

The text was updated successfully, but these errors were encountered:

rivernews · 2022-09-19T02:35:22Z

DynamoDB Modeling

Primary table: just UUID

Landing page table:

fetchedTimeStamp
newsSite Alias, etc, or reference to newsSite object model
isMetadataEverComputed <-- this can be used for query

Action items

Use a separate Terraform module for separate of concern: Refactor to allow individual tf modules #30
- Determine a data store to bridge the two module.
Terraform provision a dynamodb table.
- We may create a GSI early on. Currently isMetadataEverComputed is the important one.

address #25 (comment)

#25 (comment)

Issue: #25

For #25

rivernews · 2022-09-28T08:31:21Z

Tracked by #25 (comment)

* Ready to test * Fix db field first char not lowercase Tracked by #25 (comment) * Fix permission of db index, S3 pull Tracked by #25 (comment) * All tests complete Tracked by #25 (comment)

* Complete tf surgery; Identify all TODOs in golang For #25 * fix compile error; progress in metadata cronjob add query * Ready to test (#36) * Ready to test * Fix db field first char not lowercase Tracked by #25 (comment) * Fix permission of db index, S3 pull Tracked by #25 (comment) * All tests complete Tracked by #25 (comment)

* Draft for put landing page; identified TODOs Issue: #25 * Completed tf surgery; Identify all TODOs in golang (#35) * Complete tf surgery; Identify all TODOs in golang For #25 * fix compile error; progress in metadata cronjob add query * Ready to test (#36) * Ready to test * Fix db field first char not lowercase Tracked by #25 (comment) * Fix permission of db index, S3 pull Tracked by #25 (comment) * All tests complete Tracked by #25 (comment)

* Draining mechanism draft - identify all TODOs #25 (comment) * Draft for put landing page; identified TODOs (#34) * Draft for put landing page; identified TODOs Issue: #25 * Completed tf surgery; Identify all TODOs in golang (#35) * Complete tf surgery; Identify all TODOs in golang For #25 * fix compile error; progress in metadata cronjob add query * Ready to test (#36) * Ready to test * Fix db field first char not lowercase Tracked by #25 (comment) * Fix permission of db index, S3 pull Tracked by #25 (comment) * All tests complete Tracked by #25 (comment)

rivernews · 2022-09-29T10:50:12Z

One time batch processing

Better build a tool that would be useful later in the future.

Basically: turn S3 object(s) into a brand new DDB item.

Generate brand new DDB item: logic already there in landing
Scan S3 dir at scale. To be useful in the future, we better make it flexible
- Because we store headlines as s3://media-literacy-dev-archives/{redacted}/daily-headlines/2021-08-21T00:11:53Z/..., we can probably scan by time range.

To kick start,

How to design a pipeline that is useful in the future?
- Slack command interface
- Local CLI tool

Simplest way to do it?

Avoid writing unnecessary code. This one-time thing is going to be used really rare after this first trigger. Leverage the S3 trigger + "move/copy" feature in S3 bucket. The flow could be like:

🛑 ERROR: operation error DynamoDB: Query, https response error StatusCode: 400, 
RequestID: 653RRF4ESDBN7P72AER3M9DAURVV4KQNSO5AEMVJF66Q9ASUAAJG, api error 
ValidationException: One or more parameter values are not valid. 
A value specified for a secondary index key is not supported. 
The AttributeValue for a key attribute cannot contain an empty string value. 
IndexName: s3KeyIndex, IndexKey: s3Key | 2022/09/30 08:57:47

#25 (comment)

rivernews · 2022-10-02T06:33:44Z

There are quite big cost implication, however we don't know exact the amount of $$ we need to pay yet. But moving forward it's time to think about the fast track issues and cost saving issues. We should have another issue address these, since they are out of scope and no longer about achieving one-time batch processing.

For now, we will disable cronjob and pause the pipeline. Next time, we may copy over the stories to prod for reuse. Once we have the fast track feature #41, those will be skipped and we won't lose the computation outcome of these days.

* temp store all * remove go_poc * upgrade so project runs on M1 * Try S3 notification * Fix prefix to include newssite alias * Fix aws lambda PathError issue * Save to metadata.json complete * add untitled stories in metadata.json * rename stories function to landing_metadata * rename batch stories fetch tf to metadata * Improved metadata access s3 event * Metadata.json trigger computing env * read parse metadata.json * fetch a story POC #24 * Sfn map parallism POC #24 * randomize requests * Refactor to allow individual tf modules address #25 (comment) * scaffold table * draft table design * create table * Draining mechanism draft - identify all TODOs #25 (comment) * Draft for put landing page; identified TODOs Issue: #25 * Complete tf surgery; Identify all TODOs in golang For #25 * fix compile error; progress in metadata cronjob add query * Ready to test * Fix db field first char not lowercase Tracked by #25 (comment) * Fix permission of db index, S3 pull Tracked by #25 (comment) * All tests complete Tracked by #25 (comment) * Move landing PutItem out to s3 trigger lambda; ready for S3 batch move * create reusable lambda module; optimize package size #25 (comment) * Fix golang build path * Refactor to use our custom lambda module * add landing s3 trigger * rm golang module stories that are renamed * Fix env var * Fix permission for PutItem move from landing to s3 trigger * Fix metadata s3 trigger not fired * Fix s3 trigger not working - S3 notification can only have one resource * Make it easier to test * prod grade setting enabled * In Sfn pin lambda version, so rolling deploy works better for lambda * Display sfn map result / target stories count info in finalizer * stop landing s3 trigger from sending slack logs Fixes #40 * Let Sfn pin lambda version Fixes #39 * improve log for metadata trigger * improve cronjob log * log cronjob event for better understanding of how it get triggered * Disable cronjob to better debug Fixes #43 * workaround to scale up our Sfn pipeline Fix #44 * improve log for landing S3 trigger * re-enable prod config plus cronjob

rivernews mentioned this issue Aug 21, 2022

Fetch individual story pages #15

Closed

19 tasks

rivernews changed the title ~~Create one-time trigger all historical landing page - fetch all stories~~ (3/3) Create one-time trigger all historical landing page - fetch all stories Aug 21, 2022

rivernews added a commit that referenced this issue Sep 20, 2022

Refactor to allow individual tf modules

98b3bce

address #25 (comment)

This was referenced Sep 20, 2022

Refactor to allow individual tf modules #30

Merged

Provision table #31

Merged

rivernews added a commit that referenced this issue Sep 25, 2022

Draining mechanism draft - identify all TODOs

eb61eef

#25 (comment)

rivernews mentioned this issue Sep 25, 2022

Draining mechanism #33

Merged

rivernews added a commit that referenced this issue Sep 25, 2022

Draft for put landing page; identified TODOs

ec37c70

Issue: #25

rivernews mentioned this issue Sep 25, 2022

Draft for put landing page; identified TODOs #34

Merged

rivernews added a commit that referenced this issue Sep 25, 2022

Complete tf surgery; Identify all TODOs in golang

d816fe6

For #25

rivernews mentioned this issue Sep 25, 2022

Completed tf surgery; Identify all TODOs in golang #35

Merged

rivernews mentioned this issue Sep 29, 2022

Ready to test #36

Merged

rivernews added a commit that referenced this issue Sep 29, 2022

Fix db field first char not lowercase

c24ba2d

Tracked by #25 (comment)

rivernews added a commit that referenced this issue Sep 29, 2022

Fix permission of db index, S3 pull

233ce60

Tracked by #25 (comment)

rivernews added a commit that referenced this issue Sep 29, 2022

All tests complete

d2f752c

Tracked by #25 (comment)

rivernews added a commit that referenced this issue Sep 29, 2022

Ready to test (#36)

4ec0693

* Ready to test * Fix db field first char not lowercase Tracked by #25 (comment) * Fix permission of db index, S3 pull Tracked by #25 (comment) * All tests complete Tracked by #25 (comment)

rivernews mentioned this issue Sep 30, 2022

One time batch landings #37

Merged

rivernews added a commit that referenced this issue Sep 30, 2022

create reusable lambda module; optimize package size

3977565

#25 (comment)

rivernews closed this as completed Oct 2, 2022

rivernews mentioned this issue Oct 2, 2022

[Epic] Prod drill and ready for full prod (Fetch All Individual Stories) #45

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3/3) Create one-time trigger all historical landing page - fetch all stories #25

(3/3) Create one-time trigger all historical landing page - fetch all stories #25

rivernews commented Aug 21, 2022 •

edited

rivernews commented Sep 19, 2022 •

edited

rivernews commented Sep 28, 2022 •

edited

rivernews commented Sep 29, 2022 •

edited

rivernews commented Oct 2, 2022

(3/3) Create one-time trigger all historical landing page - fetch all stories #25

(3/3) Create one-time trigger all historical landing page - fetch all stories #25

Comments

rivernews commented Aug 21, 2022 • edited

Better way to run them all

Reference

Proper Throttling

Moving forward

rivernews commented Sep 19, 2022 • edited

DynamoDB Modeling

Action items

rivernews commented Sep 28, 2022 • edited

Test the entire pipeline

rivernews commented Sep 29, 2022 • edited

One time batch processing

Better build a tool that would be useful later in the future.

Simplest way to do it?

rivernews commented Oct 2, 2022

rivernews commented Aug 21, 2022 •

edited

rivernews commented Sep 19, 2022 •

edited

rivernews commented Sep 28, 2022 •

edited

rivernews commented Sep 29, 2022 •

edited