Skip to content

Cleaning a dataset

Matti Schneider edited this page Nov 15, 2022 · 1 revision

Goals

Publishing a manually reviewed and cleaned dataset enables:

  1. improving its quality;
  2. documenting its limitations;
  3. reusing it easily;
  4. if so wished, stopping maintaining the instance that produces it.

Problem definition

Along the life of an instance, unsatisfactory versions of documents might be extracted from snapshots. For example, they might be changes unrelated to terms, or empty documents, or change language… Such unsatisfactory versions decrease the value of the dataset: it becomes impossible to measure the actual number of changes, for example.

Reviewing and cleaning the dataset entails correcting the history of declarations, identifying some snapshots to skip, and extracting new versions from the snapshots based on this information. In the end, the whole versions history will be rewritten and overwritten. The declarations will be completed. All the original snapshots are left unchanged and the previous state of the versions is still available, allowing auditability.

Process

  • Iterate on every service
  • Iterate on every document type
  • Iterate on every snapshot
  • Extract version
    • This will automatically erase refilters. Indeed, refilters are only historical artifacts: they correct a version that should not have been recorded as it was in the first place.
  • If the version cannot be generated:
    • If the snapshot is unexploitable, skip it. A snapshot is unexploitable if it does not contain the tracked document. We have encountered so far:
      • Empty content
      • Botwall
      • Loginwall
      • Cookiewall
      • Server error
      • Exception: if the provider is in a certain manner unable to provide the document to its expected audience, and not only to Open Terms Archive, this should be tracked (e.g. undergoing maintenance)
    • If the snapshot is exploitable, correct declaration. Potential reasons are:
      • Some selector is wrong. Usually, that means the history date for applying that selector is wrong (otherwise the declaration was wrong from the beginning). Take the fetchDate of the last snapshot that does not fail to generate a version as validUntil
  • If the generated version markup differs significantly, remove changes that do not reflect a change in the document content itself.
    • We have encountered so far:
      • Switching list styles (ordered to unordered list)
      • Switching between mobile and desktop pages
      • Switching between geographic region-optimised layouts
      • Switching between languages (934bddb9cdf40e7c53b5c43d0db3dc393e2a2eb4)
      • Switching between different browser-optimised layouts
      • Note: these should happen less and less as:
        • The Core is optimised to minimise such changes (single user agent)
        • Deployment is optimised to minimise such changes (single well-known IP)
        • Operations are optimised to minimise such changes (single process instead of parallel, decreasing the number of requests)
    • Known tactics, by order of preference:
      1. Declare both layouts in the same declaration
        • By using mutually exclusive selectors where each is applicable only in one case, yet the combination covers all cases
      2. Unify markup with filters (e.g. unwrap final destination URL of a link from a query parameter, replace some tags by others…)
      3. Skip the snapshot entirely (e.g. alternating between mobile and desktop pages). Choosing which ones to skip in the alternative is done with the following constraints:
        1. Maximise version quality (more markup, better readability)
        2. Maximise frequency (at least one version a day)
        3. Minimise changes to declaration
        4. Minimise declaration complexity
  • Review versions and apply some sanity checks
    • Add filters