Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping indie game subreddits for developers and games #2

Closed
37 tasks done
andygello555 opened this issue Jan 13, 2023 · 0 comments · Fixed by #33
Closed
37 tasks done

Scraping indie game subreddits for developers and games #2

andygello555 opened this issue Jan 13, 2023 · 0 comments · Fixed by #33
Assignees
Labels
enhancement New feature or request

Comments

@andygello555
Copy link
Owner

andygello555 commented Jan 13, 2023

Proposition

This would be a process that is very similar to the current Twitter scraping process.
We will need to check the following subreddits daily:

  • r/GameDevelopment
  • r/SoloDevelopment
  • r/gamedesign
  • r/IndieDev
  • r/truegamedev
  • r/gamedev
  • r/IndieGaming.
  • r/gamedevscreens

What will need to change

  • New tables
    • RedditDeveloper table to track individual developers found on Reddit
    • RedditDeveloperSnapshot table to save the snapshots of RedditDevelopers to
  • Need to alter the Games table to include references to RedditDevelopers (probably in a new array column)
  • Need to create a whole new scraping process for Reddit
    • Either add more Phases to the Scout procedure
    • Or add a completely new set of Phase.Runs for the Reddit scrape
  • Would also need to add a bunch more cached fields to the ScoutState object

How will scraping work?

  • Discovery
    • Fetch the top X posts divided equally between the subreddits mentioned above + max Y comments for each post
    • For each Post we will see if we can extract any game URLs from the:
      • Post title
      • Post body
      • Any of the comments from the OP
    • Aggregate post metrics + comment engagement + author's karma
    • Run any Game URLs through the storefront scrapers
  • Update (this will probably need to run after the Twitter updates have ended because they need to be synced up in entirely different ways).
    • Any Developers marked as Reddit devs we will fetch the posts that they have made to the subreddits mentioned above in the last 7 days
    • Aggregate post metrics + comment engagement + author's karma

TODO implementation:

  • These two updated columns now take usernames of both Reddit and Twitter users. Twitter usernames will be prefixed with "T" and Reddit usernames will be prefixed with "R". We need to double check the usages of these fields, as well as their DB column counterparts, to ensure that we are setting and using these fields properly after these changes
    • Check all usages of Game.Developers as well as its column: games.developers
    • Check all usages of Game.VerifiedDeveloperUsernames as well as its column: games.verified_developer_usernames
  • We need to add new versions of the UserTweetTimes and DeveloperSnapshots cached fields for Reddit developers
  • Update Discovery phase
    • Create TestDiscoveryPhase test
  • Update Update phase
    • Update TestUpdatePhase test
  • Update Snapshot phase
    • Update TestSnapshotPhase test
  • Update Disable phase
    • Update TestDisablePhase test
  • Update Enable phase
    • Update TestEnablePhase test
  • Update Delete phase. Wasn't that hard to update as we don't need to distinguish between the different Developer types here
    • Update TestDeletePhase test
  • Update Measure phase
    • Change wording in Measure emails regarding Twitter developers
      • Need to use RedditPostIDs sometimes instead of TweetIDs
    • New Reddit "From" label needed in linked Monday board + need to change Monday config to find where a game was found by looking at its developers
    • Add a test for the Measure phase
      • Need a new mapping config for tests + new Monday board
  • Add the new Reddit metric fields in DeveloperSnapshot as weighted fields
    • RedditPublicMetrics
    • PostPublicMetrics
  • SubredditFetch procedure should check rate limits for non-paginateable bindings
  • Add new constants to README

TODO after implementation:

  • Set developers.type column to "T" (for TwitterDeveloperType): update developers set type = 'T';
  • Prefix each developer username in games.developers with "T" (for TwitterDeveloperType) for each Game in games: update games set (developers, verified_developer_usernames) = (array(select 'T' || unnest from unnest(developers)), array(select 'T' || unnest from unnest(verified_developer_usernames)));
  • Same thing as the above for the games.verified_developer_usernames column
  • Set developers.current_reddit_* to 0? Might not need to because we might check for nil anyway for RedditPublicMetrics Gorm always returns the empty embedded struct even if all columns in the struct are null in the DB
  • Set developer_snapshots.reddit_user_* to 0? Might not need to because we might check for nil anyway for RedditPublicMetrics Gorm always returns the empty embedded struct even if all columns in the struct are null in the DB
  • Set developer_snapshots.total_post_* to zero values? Might not need to because we might check for nil anyway for PostPublicMetrics Gorm always returns the empty embedded struct even if all columns in the struct are null in the DB
  • Might need to set developer_snaphots.reddit_post_ids to its default
  • Make sure the new config.json constants and Monday values are changed to their appropriate values
  • Might need to increase constants.scout_timeout in config.json
  • Need to run update computed fields procedure because of the changes to the weights to the weighted fields

To reset DB to state without Reddit developers:

Delete developers

delete from developers where id in (select r.id from (select id, ('R' || username) username from developers where type = 'R') r, games where r.username = ANY(games.developers) and cardinality(developers) = 1);

Remove Reddit usernames from games

update games set developers = array_remove(developers, r2.unnest) from (select * from (select unnest(developers) from games where 'R' = ANY(select left(unnest, 1) from unnest(developers))) r where left(r.unnest, 1) = 'R') r2 where r2.unnest = any(developers);

Delete any games with no verified/non-verified developers

delete from games where cardinality(developers) = 0 and cardinality(verified_developer_usernames) = 0;

Monday.com Item ID: #3806777653

@andygello555 andygello555 added the enhancement New feature or request label Jan 13, 2023
@andygello555 andygello555 self-assigned this Jan 13, 2023
@andygello555 andygello555 added enhancement New feature or request and removed enhancement New feature or request labels Jan 13, 2023
andygello555 added a commit that referenced this issue Mar 8, 2023
- Refactored all the Binding + Paginator + API types and interfaces to the api package. This works similarly to the Binding interface that existed within the monday package except the request produced by Binding.Request is an interface (api.Request), and the client taken by Binding.Execute is also an interface (api.Client) (07/03/2023 - 16:28:13)
- This allows us to create entire schema's of bindings that we can then add to an instance of the API type which acts as a wrapper for a set of Bindings (07/03/2023 - 16:29:14)
- Removed the definitions of types that are now defined in api from the monday package (07/03/2023 - 16:29:44)
- Updated all the bindings in the monday and models packages to use the new function signatures (07/03/2023 - 16:30:07)
- Updated the monday subcommand in the CLI and the Measure phase to use the new paginator type (07/03/2023 - 16:30:38)
- Upgrade gotils to v2.1.2 (08/03/2023 - 12:59:02)
- Refactored all the NewBinding logic into the api package. This means that Bindings for the entire project can be created through the api.NewBinding method. Due to this, I have removed the now unecessary code from monday/bindings.go as well as changed the Bindings in the monday and models packages to use this new API (08/03/2023 - 15:17:54)
- Added the reddit/types.go file to hold all the return and response types for the Reddit API bindings (08/03/2023 - 16:23:19)
- Paginator instance has now been replaced by the Paginator interface and two new Paginator implementations: typedPaginator and paginator. NewTypedPaginator returns a Paginator that is type aware. This can only be used by Bindings that are set to their own global variables. NewPaginator returns a Paginator that is not type aware, instead it returns a Paginator[[]any, []any] (08/03/2023 - 18:27:25)
- Added a refresh to reddit.Client.Run (08/03/2023 - 18:50:40)
andygello555 added a commit that referenced this issue Mar 10, 2023
- Made the Binding interface chainable and started the argument type checking with the new BindingParam type and Params() interface method #24 (09/03/2023 - 11:52:58)
- Added type checks to bindingProto.Execute #24 (09/03/2023 - 13:39:05)
- Added a way of defining BindingParams for interface values that will also be type checked by bindingProto.TypeCheckArgs #24 (10/03/2023 - 16:28:50)
- Added the api.Params function that can construct a list of BindingParams from a list of argument groupings (10/03/2023 - 16:43:24)
- Updated all the Bindings created via derivatives of the NewBinding method to set the Params method of their respective Bindings (10/03/2023 - 16:44:37)
- Added an example for the Params function (10/03/2023 - 16:44:56)
- Added a test for the Params function (10/03/2023 - 16:45:22)
- Added a test for the bindingProto.TypeCheckArgs (10/03/2023 - 16:45:39)
- Added some more fields to the BindingProto struct to aid with interface checking (10/03/2023 - 16:46:11)
- I may have gone a bit overboard with the api package... (10/03/2023 - 16:48:39)
- Moved everything to do with paginators, bindings, and params in api to their own files to tidy things up a bit (10/03/2023 - 16:56:31)
- Updated top binding for the reddit API (10/03/2023 - 18:50:55)
- Added a lot more types to the reddit API types.go file. The listingWrapper probably needs more fleshing out #2 (10/03/2023 - 18:51:38)
- Added the Binding.ArgsFromStrings interface method and implementation (10/03/2023 - 18:53:12)
andygello555 added a commit that referenced this issue Mar 14, 2023
- Added the reddit.RateLimit type for tracking rate limits in the reddit API, as well as the RateLimitsConfig interface for configuring rates (13/03/2023 - 15:16:26)
- A lot more reddit.Types taken mostly from https://github.com/vartanbeno/go-reddit/reddit/things.go (13/03/2023 - 15:23:14)
- Paginator's are now aware of multiple different types of pagination parameters thanks to the Binding.Params method (14/03/2023 - 14:14:10)
- The untyped paginator is now the type Paginator[any, any] which makes more sense and allows us to actually use the Afterable interafce (14/03/2023 - 16:22:05)
andygello555 added a commit that referenced this issue Mar 16, 2023
- Paginators now handle paginator params that are out of order  (15/03/2023 - 14:37:32)
- Fully implemented the top, comments, and user_where bindings in the reddit.API (16/03/2023 - 15:07:18)
andygello555 added a commit that referenced this issue Mar 20, 2023
- Added the Paginator.Until method that can be supplied a predicate function (20/03/2023 - 14:36:47)
- Added the start of the RedditDiscoveryPhase (20/03/2023 - 14:37:04)
- Added the SubredditFetch function and a test for it main_test (20/03/2023 - 14:37:17)
- Added a few constants to ScrapeConstants for the new Reddit discovery coroutine (20/03/2023 - 14:37:36)
andygello555 added a commit that referenced this issue Mar 20, 2023
andygello555 added a commit that referenced this issue Mar 21, 2023
- Added the DeveloperType enum type (21/03/2023 - 15:56:35)
- Added the Developer.Type field of type DeveloperType (21/03/2023 - 15:56:55)
- Added the Developer.RedditPublicMetrics field of type RedditUserMetrics which stores karma for RedditDeveloperTypes (21/03/2023 - 15:57:31)
- Changed DB types of Game.Developers and Game.VerifiedDeveloperUsernames to varchar(20)[]. 20 chars now instead of 18 to suit Reddit usernames (21/03/2023 - 15:58:27)
- Added the DeveloperSnapshot.RedditPostIDs pq.StringArray field to store the names of subreddits and the ID of a post as pairs of strings (21/03/2023 - 16:07:15)
- Added the samples/sampleRedditPostsCommentsAndUsers.json file which might come in handy for testing (21/03/2023 - 19:03:26)
- Registered the DeveloperType enum in db.init() (21/03/2023 - 19:03:42)
- Updated DiscoveryBatch to take a PostIterable interface instead of a dictionary of tweets so that I can pass an array of PostCommentsAndUser instances to it (21/03/2023 - 19:04:49)
- General refactoring of processes related to the DiscoveryBatch procedure (21/03/2023 - 19:05:44)
andygello555 added a commit that referenced this issue Mar 23, 2023
- Modified Developers and VerifiedDeveloperUsernames fields in Game to take a list of DeveloperType prefixed Usernames so we can keep both Reddit and Twitter usernames in the same arrays (23/03/2023 - 10:06:47)
- The above change meant that I had to modify some other bits of code, most noteably: Developer.Games, Game scrape procedures, DeveloperSnapshot.calculateGameField, MeasurePhase, DeletePhase (23/03/2023 - 10:08:33)
- Added the RedditUserPostTimes and the RedditDeveloperSnapshots cached fields (23/03/2023 - 13:19:06)
- Added a goroutine for RedditDiscoveryPhase into the DiscoveryPhase procedure (23/03/2023 - 14:16:48)
- Paginators now try multiple times to get the latest rate limit (23/03/2023 - 15:43:40)
- Added write mutexes to the bindingProto for the attrs map and the attrFuncs slice (23/03/2023 - 16:35:06)
- paginatorCheckRateLimit will now only return an error if the rate limit cannot be found rather than also if the latest rate limit is before the current time (23/03/2023 - 17:27:33)
- Added the Log method to the RateLimitedClient interface so I could see what is going on with rate limit errors being thrown (23/03/2023 - 17:29:48)
andygello555 added a commit that referenced this issue Mar 28, 2023
- Added the PostsConsumed field to DiscoveryUpdateSnapshotStats (24/03/2023 - 11:03:21)
- replaced all occurrences of time.Now() with time.Now().UTC() because heck timezones (24/03/2023 - 11:58:04)
- Paginator.Until's predicate function now takes the currently collected pages (27/03/2023 - 11:33:27)
- Updated UpdateDeveloper to deal with Reddit Developers. It does this by making paginated requests to the user_overview binding until it can find 11 (or less) total posts and comments that exist after the start time (27/03/2023 - 13:26:06)
- Waiting for Scout procedure running on VM to finish before updating the TestUpdatePhase test  (27/03/2023 - 13:26:52)
- Turned CachedFieldIterator into an interface then made the old implementation into an implementation of this interface. I then created the mergedCachedFieldIterators implementation that can iterate over many of the CachedFieldIterators (27/03/2023 - 13:51:33)
- Added the CachedField.Type method which returns the CachedFieldType of a CachedField which is useful for the MergedCachedFieldIterator (27/03/2023 - 14:02:59)
- Added the CachedFieldIterator.Field method which returns the cached field that the iterator is for (27/03/2023 - 14:03:32)
- Modified the SnapshotsPhase to handle Reddit developers (27/03/2023 - 14:50:15)
- Added the sampleRedditIDsUsernames.csv sample as well as updating sampleUserIDs.txt to be 50/50 Twitter and Reddit users (27/03/2023 - 15:27:57)
- Think SnapshotPhase and TestSnapshotPhase have now been updated (28/03/2023 - 16:34:55)
- Updated gotils dependency to newest version with shiny new slices methods (28/03/2023 - 16:35:32)
andygello555 added a commit that referenced this issue Apr 5, 2023
- Added the Error Reddit Type for returning more descriptive errors from Client.Run (29/03/2023 - 18:02:27)
- Updated the TestDisablePhase test to check the disabled developers for both Twitter and Reddit developers (30/03/2023 - 12:48:58)
- Updated the DisablePhase to iterate over each models.DeveloperType and disable the required amount for each (30/03/2023 - 12:49:23)
- Updated the State.UpdatedDevelopers/DisabledDevelopers/EnabledDevelopers fields to be an array of models.DeveloperMinimal which contains a models.DeveloperType field that can be used to filter these arrays (30/03/2023 - 14:35:50)
- State.DevelopersToEnable is now also a map of models.DeveloperType to integer counts after the refactor of the EnablePhase (30/03/2023 - 14:36:37)
- Refactored EnablePhase as well as the test for it (30/03/2023 - 14:37:39)
- Changed createFakeDevelopersWithSnaps to be zero-indexed rather than one-indexed because it was getting really annoying (30/03/2023 - 14:56:15)
andygello555 added a commit that referenced this issue Apr 5, 2023
- Starting the new TestMeasurePhase test (03/04/2023 - 16:08:58)
- Added the MondayConfig.TestMapping field which is a duplicate type of Mapping that is the mapping that will be used in testing (03/04/2023 - 16:16:23)
- createFakeDevelopersWithSnaps now also creates Itch.IO games (03/04/2023 - 16:25:22)
- Updated the whereZeroVerified query in the DeletePhase (04/04/2023 - 11:27:59)
- Added the models.Game.Advocates method that returns a the models.Developer's that have tweeted/posted about a given Game (04/04/2023 - 11:29:03)
- Added the postIDPairs methods to MeasureContext.Funcs that will construct an array of structs containing the subreddit and post pairs within the DeveloperSnapshot.RedditPostIDs array (04/04/2023 - 11:30:15)
- Updated the measure email template to display the correct post IDs depending on the Developer's type + username is prefixed by either @ or u/ depending the Developer's type (04/04/2023 - 11:31:44)
andygello555 added a commit that referenced this issue Apr 5, 2023
- createFakeDevelopersWithSnaps now creates snaps with CreatedAt set to be a linear range (04/04/2023 - 14:20:05)
- createFakeDevelopersWithSnaps now creates a unique ID for Steam Games from the current developer index and the current game index which can be reversed (04/04/2023 - 14:20:52)
- Added the cat function to MeasureContext.Funcs (04/04/2023 - 15:16:58)
- Fixed some issues with the measure email template not being descriptive enough about the types of Developers (04/04/2023 - 15:42:44)
- Added the models.DeveloperType.EnumValue which returns the character representing a DeveloperType (04/04/2023 - 15:44:51)
- Added the DeleteItem Monday API client binding (04/04/2023 - 16:12:53)
- Added the monday.Me binding, for use in testing (05/04/2023 - 12:33:35)
- Added the additionalColumnValues to the models.AddGameToMonday binding (05/04/2023 - 13:01:51)
- Added the MustTypePaginate and MustPaginate functions to paginator.go that lets me create paginators without so much error checking (05/04/2023 - 13:21:48)
- Finally implemented the TestMeasurePhase test (05/04/2023 - 13:43:36)
@andygello555 andygello555 linked a pull request Apr 7, 2023 that will close this issue
andygello555 added a commit that referenced this issue Apr 19, 2023
- Added the CachedFieldTypeFromName function which returns the CachedFieldType from the given name (19/04/2023 - 11:15:26)
- Added the ScoutState.GetIterableCachedFieldFromName + ScoutState.GetCachedFieldFromName methods to ScoutScate that utilise the CachedFieldTypeFromName function (19/04/2023 - 11:16:09)
- Updated the started and error email templates to use these new methods rather than passing in the underlying integer representing a CachedFieldType (19/04/2023 - 11:16:56)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant