Skip to content

WIP: Parse archived parler pages into structured html

License

Notifications You must be signed in to change notification settings

tarqd/parler-parse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

parler-parse

Parler HTML goes in (stdin), structured JSON comes out (stdout)

Might be useful for feeding into elasticsearch or cross-referencing with the video/images dump.

Usage

You will need a rust compiler (easiest way is via rustup) to build from source. After that run the following commands in your terminal:

# clone the repo
git clone https://github.com/ilsken/parler-parse.git && cd parler-parse

# run the example
https://github.com/ilsken/parler-parse.git
cargo run < examples/echo--parent-no-comment.html

CLI options

USAGE:
    parler-indexer [FLAGS] [OPTIONS] [--] [path]...

FLAGS:
    -c, --compact      Output compact (single line) JSON. Defaults to true if stdin in not a terminal
    -h, --help         Prints help information
    -r, --recursive    Recursively search directories
    -V, --version      Prints version information

OPTIONS:
        --fail-log <fail file>              Write failed paths to a file
        --paths-from-file <path file>...    Read paths from a file
        --success-log <success file>        Write successfully processed paths to a file

ARGS:
    <path>...    HTML File(s) or directory of HTML File(s) to parse

Where do I get the archives?

This project was developed against the "partial parler post text" archive that available from Distributed Denial of Secrets.

Currently parses:

  • OG Meta
  • Posts + Echos
    • Author (username + name + avatar + badge)
    • Body
    • Media Attachments (Url, Title, Excerpt, Type, ID (numeric and base62/hex encoded))
  • Comments + Replies + Engagements
  • Metrics (impressions, echoes, comment count, etc)
  • All mentioned usernames in the post
  • Profile pages + all posts
  • Estimated timestamp offset (3 days ago -> - 3 days in seconds)

Roadmap

  • ✅ Bug: Author field will be null if a user just echoe'd a post (only has the author of the echoed post). We can populate it with the og meta title field
  • ✅ Multi-threaded, recursive directory processing (crossbeam + rayon)
  • ✅ Allow bulk / multi-threaded processing for all files in a directory for quickly importing into elastic/mellisearch/tantivy
  • [TODO] Add file metadata (create/modified date/path)
  • [TODO] WARC support + metadata
  • [TODO] Fix up timestamps based on metadata

Example output

{
  "opengraph_meta": {
    "title": "@AnthonyDaubs - AnthonyDaubs -",
    "owner": {
      "name": "AnthonyDaubs",
      "username": "@AnthonyDaubs"
    },
    "url": "/post/8c36602d9568482dacfc55d9b63d5a07",
    "image_url": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256"
  },
  "posts": [
    {
      "echo_by": null,
      "cards": [
        {
          "kind": "Post",
          "author": {
            "name": "AnthonyDaubs",
            "username": "@AnthonyDaubs",
            "avatar": {
              "url_raw": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256",
              "url": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256",
              "host": "images.parler.com",
              "is_external": false,
              "id": "af00acf47ba74651998fb9676aabd117"
            }
          },
          "rel_ts": "2 days ago",
          "approx_ts_offset": -172800,
          "body": "",
          "impression_count": 3,
          "is_sensitive_content": true,
          "media_items": [
            {
              "kind": "Video",
              "title": "",
              "link": {
                "label": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url_raw": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "host": "video.parler.com",
                "is_external": false,
                "id": "Q2s5oVN1pfgk",
                "id_b62_dec": 1355361448748163000000
              },
              "excerpt": "",
              "source": {
                "label": "",
                "url_raw": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "host": "video.parler.com",
                "is_external": false,
                "id": "Q2s5oVN1pfgk",
                "id_b62_dec": 1355361448748163000000
              },
              "numeric_id": null
            }
          ]
        }
      ],
      "comments": [],
      "post_id": null,
      "mentions": [],
      "engagements": {
        "comment_count": 0,
        "echo_count": 0,
        "upvote_count": 0
      }
    }
  ]
}

License

MIT licensed, feel free to use it. If you want to use it for research, I'd love to hear about it and help if I can. Shoot me an email or message me on twitter (@chris_tarquini)

About

WIP: Parse archived parler pages into structured html

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages