Skip to content

PageGraph

pes10k edited this page Apr 9, 2024 · 51 revisions

PageGraph is a research project developed by Brave to instrument Brave browser, blink and v8, to allow for complete attribution of document modifications, network requests, script execution, and privacy-relevant Web API accesses.

PageGraph is included in all builds of Brave starting with version 1.46, though enabling it requires passing some command line arguments. The easiest way to run PageGraph is to use pagegraph-crawl tools and scripts, which automate enabling PageGraph in Brave, launching Brave, and recording the resulting graph files.

Note that recording builtin JS APIs requires nightly Brave builds. This choice is made to slightly optimize the performance of the Beta and Stable builds that most Brave users use. A list of which Web APIs are instrumented can be found in python interface.py file which controls, at build time, which APIs are instrumented. You can instrument additional APIs by indicating additional APIs in this file, and rebuilding Brave.

PageGraph's name comes from the graph-based representation of the document's execution it builds in memory. Every relevant event in the document (a node being created, a network request being triggered, a script being executed, etc.) is recorded in the graph, noting both the relevant event, and the event's cause.

The resulting graph allows for the analysis of the proximate and upstream causes of every modification in the document. We expect this information will be useful for a variety of filter list generation, privacy-preserving feature restrictions, and web compatibility analysis, among other uses.

PageGraph is the next, production-ish version of AdGraph, an earlier system for representing page execution as a directed graph. PageGraph advances AdGraph in a large number of ways, increasing the breath and accuracy of attributions, enabling GraphML export, being kept up to date with Brave and Chromium, among many other improvements.

Tools

You can find tools for querying data from PageGraph archives either for python (pageraph-query) or rust (pagegraph-rust).

Questions / Clarifications

If you have any questions about PageGraph, or would like to use / extend it for your research purposes, feel free to ping pes@brave.com / @pes10k.

Features

PageGraph Currently Does the Following

  • Tracking the execution and results of all script execution (inline, remote, attributes, eval, etc.) and how the script got on the page.
  • Tracking most resource requests (see below for exceptions, but currently covers AJAX, fetch, images, scripts and CSS), include response headers and size
  • Track all DOM node creations, deletions and modifications (note small exceptions below)
  • Track interactions with shields and network requests (including keeping track of which rules filter rules resulted in the behavior)
  • Track when scripts call common Web API fingerprinting methods
  • Puppeteer integration
  • GraphML export for offline / after the fact analysis
  • Chromium devtools integration, for in-browser analysis of the graph. Currently allows for observing the full history of any node in the document.
  • Timestamp all events
  • Module scripts
  • Handling remote frames
  • Instrumenting what parts of JS are responsible for which API calls

Not Done But Will Be Done Shortly

  • Scripts in SVG documents

Would Be Nice / Someday / Known Limitations / Not Important for Current Needs

  • Recording the body of large, streamed, requests (<audio> and <video>)
  • Dealing with WebSockets in anyway
  • Better handling of JS urls (currently these are treated as originating from the document node, instead of the node with the relevant URL, b/c of how blink is structured)
  • Tracking css @imports
  • Tracking style= modifications
  • Including request headers (currently only response headers are recorded)
  • Better handling of HSTS and URL tracking. Currently requests are tracked w/o URL fragments (b/c they're ignored by the blink cache) or protocol (so that the same response can be matched with its request, even if it was HTST upgraded). This works "well enough", but it's not 100% precise, and will loose information in some cases.
  • Any kind of tracking of worker behaviors
  • Resources fetched b/c of CSS rules are currently tracked in the graph, but not attributed to either the relevant CSS document / rule, or the DOM element that triggered the request. Would be nice to tie the request to either or both.

Building

As of version 1.46, PageGraph is included in all brave-core (i.e., desktop and Android) versions of Brave. There is no need for a special, customized build process.

Running

The PageGraph-enabled browser can be run manually like a normal build of Brave, and can still be used to generate graph files when run manually. However it is recommended to run using an automated crawling tool like pagegraph-crawl.

Publications Using PageGraph

PageGraph Documentation

The best place to find documentation on the structure and types in the resulting GraphML-format graphs is in the documentation for the rust library we maintain for parsing and querying these graphs. In particular, the types documentation may be helpful.

This documentation is incomplete and being built up as we go. If you have questions about the document format that aren't answered by the existing documentation, please do not hesitate to open an issue.

However, note that PageGraph currently does not track puppeteer / automation scripts, and so modifying or interacting with the document through devtools/puppeteer while recording a PageGraph file will fail.

Clone this wiki locally