Skip to content

Potential Data Reuses

Matti Schneider edited this page May 19, 2022 · 1 revision

Readability analysis

See, e.g., this paper

Time of implementation: ⭐ ▶️ The tool in the stats repo already allows you to compute Flesch reading ease and Flesch–Kincaid grade level metrics.

💡 : track how these 2 measures change when an update to a document is made

Documents embedding analysis

See this article for a primer on the topic.

Time of implementation: ⭐ ⭐ ▶️ Some models such as doc2vec have python implementations that are easy to integrate to our repo. Plotting/representing these embeddings is a little bit more involved but still relatively simple.

💡 : Projecting documents into 2-dimensional embedding could allow us to produce "maps" of documents — grouping service providers together by proximity and making interesting comparisons. For example, are Facebook terms semantically closer to Instagram's terms than to Twitter's?

Correlate terms changes over time

Time of implementation: ❓ ▶️ This would be much more exploratory.

💡 : do certain service providers tend to perform the same kind of updates at the same time? can we correlate some changes with some well-identified exogenous events (e.g. Covid crisis, a new legislation, etc.) and measure the average time it takes each service provider to perform a terms update in response to these events?

Detect unofficial vs official changes

💡 : Some changes are applied “officially” and modify the “last updated” date. Some changes are applied, yet are not detectable by the end user if they don't use Open Terms Archive. How often are terms updated without the users' knowledge? In which proportion? Do “official” changes always correlate with “significant” changes?