runtime+storage: integrate disk storage

With this change, the disk backend (badger) becomes available for use with the OPA runtime properly: It can be configured using the `storage.disk` key in OPA's config (see included documentation). When enabled, - any data or policies stored with OPA will persist over restarts - per-query metrics related to disk usage are reported - Prometheus metrics per storage operation are exported The main intention behind this feature is to optimize memory usage: OPA can now operate on more data than fits into the allotted memory resources. It is NOT meant to be used as a primary source of truth: there are no backup/restore or desaster recovery procedures -- you MUST secure the means to restore the data stored with OPA's disk storage by yourself. See also open-policy-agent#4014. Future improvements around bundle loading are planned. Some notes on details: storage/disk: impose same locking regime used with inmem With this setup, we'll ensure: - there is only one open write txn at a time - there are any number of open read txns at a time - writes are blocked when reads are inflight - during a commit (and triggers being run), no read txns can be created This is to ensure the same atomic policy update semantics when using 'disk" as we have with "inmem". We're basically opting out of badger's currency control and transactionality guarantees. This is because we cannot piggy back on that to ensure the atomic update we want. There might be other ways -- using subscribers, and blocking in some other place -- but this one seems preferrable since it mirrors inmem. Part of the problem is ErrTxnTooLarge, and committing and renewing txns when it occurs: that, which is the prescribed solution to txns growing too big, also means that reads can see half of the "logical" transaction having been committed, while the rest is still getting processed. Another approach would have been using `WriteBatch`, but that won't let us read from the batch, only apply Set and Delete operations. We currently need to read (via an iterator) to figure out if we need to delete keys to replace something in the store. There is no DropPrefix operation on the badger txn, or the WriteBatch API. storage/disk: remove commit-and-renew-txn code for txn-too-big errors This would break transactional guarantees we care about: while there can be only one write transaction at a time, read transactions may happen while a write txn is underway -- with this commit-and-reset logic, those would read partial data. Now, the error will be returned to the caller. The maximum txn size depends on the size of memtables, and could be tweaked manually. In general, the caller should try to push multiple smaller increments of the data. storage/disk: implement noop MakeDir The MakeDir operation as implemented in the backend-agnostic storage code has become an issue with the disk store: to write /foo/bar/baz, we'd have to read /foo (among other subdirs), and that can be _much_ work for the disk backend. With inmem, it's cheap, so this wasn't problematic before. Some of the storage/disk/txn.go logic had to be adjusted to properly do the MakeDir steps implicitly. The index argument addition to patch() in storage/disk/txn.go was necessary to keep the error messages conforming to the previous code path: previously, conflicts (arrays indexed as objects) would be surfaced in the MakeDir step, now it's entangled with the patch calculation. storage/disk: check ctx.Err() in List/Get operations This won't abort reading a single key, but it will abort iterations. storage/disk: support patterns in partitions There is a potential clash here: "*", the path wildcard, is a valid path section. However, it only affects the case when a user would want to have a partition at /foo/*/bar and would really mean "*", and not the wildcard. Storing data at /foo/*/bar with a literal "*" won't be treated differently than storing something at /fo/xyz/bar. storage/disk: keep per-txn-type histograms of stats This is done by reading off the metrics on commit, and shovelling their numbers into the prometheus collector. NOTE: if you were to share a metrics object among multiple transactions, the results would be skewed, as it's not reset. However, our server handlers don't do that. storage/disk: opt out of badger's conflict detection With only one write transaction in flight at any time, the situation that badger guards against cannot happen: A transaction has written to a key after the current, to-be-committed transaction has last read that key from the store. Since it can't happen, we can ignore the bookkeeping involved. This improves the time it takes to overwrite existing keys. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
rokkiter · Apr 18, 2022 · 0a8a933 · 0a8a933
1 parent ef31026
commit 0a8a933
Show file tree

Hide file tree

Showing 39 changed files with 2,476 additions and 506 deletions.
diff --git a/config/config.go b/config/config.go
@@ -34,6 +34,9 @@ type Config struct {
 	Caching                      json.RawMessage            `json:"caching,omitempty"`
 	PersistenceDirectory         *string                    `json:"persistence_directory,omitempty"`
 	DistributedTracing           json.RawMessage            `json:"distributed_tracing,omitempty"`
+	Storage                      *struct {
+		Disk json.RawMessage `json:"disk,omitempty"`
+	} `json:"storage,omitempty"`
 }
 
 // ParseConfig returns a valid Config object with defaults injected. The id

diff --git a/docs/content/configuration.md b/docs/content/configuration.md
@@ -837,3 +837,20 @@ The following encryption methods are supported:
 | `off` | Disable TLS |
 | `tls` | Enable TLS |
 | `mtls` | Enable mutual TLS |
+
+### Disk Storage
+
+The `storage` configuration key allows for enabling, and configuring, the
+persistent on-disk storage of an OPA instance.
+
+If `disk` is set to something, the server will enable the on-disk store
+with data put into the configured `directory`.
+
+| Field | Type | Required | Description |
+| --- | --- | --- | --- |
+| `storage.disk.directory` | `string` | Yes | This is the directory to use for storing the persistent database. |
+| `storage.disk.auto_create` | `bool` | No (default: `false`) | If set to true, the configured directory will be created if it does not exist. |
+| `storage.disk.partitions` | `array[string]` | No | Non-overlapping `data` prefixes used for partitioning the data on disk. |
+| `storage.disk.badger` | `string` | No (default: empty) | "Superflags" passed to Badger allowing to modify advanced options. |
+
+See [the docs on disk storage](../misc-disk/) for details about the settings.
diff --git a/docs/content/misc-disk.md b/docs/content/misc-disk.md
@@ -0,0 +1,177 @@
+---
+title: Disk Storage
+kind: misc
+weight: 10
+---
+
+This page outlines configuration options relevant to using the disk storage
+feature of OPA.
+Configuration options are to be found in [the configuration docs](../configuration/#disk-storage).
+
+{{< info >}}
+The persistent disk storage enables OPA to work with data that does not fit
+into the memory resources granted to the OPA server.
+It is **not** supposed to be used as the primary source of truth for that data.
+
+The on-disk storage should be considered ephemeral: you need to secure the
+means to restore that data.
+Backup and restore, or repair procedures for data corruption are not provided
+at this time.
+{{< /info >}}
+
+## Partitions
+
+Partitions determine how the JSON data is split up when stored in the
+underlying key-value store.
+For example, this table shows how an example document would be stored given
+different configured partitions:
+
+```json
+{
+  "users": {
+    "alice": { "roles": ["admin"] },
+    "bob": { "roles": ["viewer"] }
+  }
+}
+```
+
+| Partitions | Keys | Values |
+| --- | --- | --- |
+| (1) none | `/users` | `{"alice": {"roles": ["admin"]}, "bob": {"roles": ["viewer"]}}}` |
+| --- | --- | --- |
+| (2) `/users` | `/users/alice` | `{"roles": ["admin"]}`  |
+|          | `/users/bob`   | `{"roles": ["viewer"]}` |
+| --- | --- | --- |
+| (3) `/users/*` | `/users/alice/roles` | `["admin"]`  |
+|            | `/users/bob/roles`   | `["viewer"]` |
+
+Partitioning has consequences on performance: in the example above, the
+number of keys to retrieve from the database (and the amount of data of
+its values) varies.
+
+| Query | Partitions | Number of keys read |
+| --- | --- | --- |
+| `data.users` | (1) | 1 |
+|              | (2) | 2 |
+|              | (3) | 2 |
+| --- | --- | --- |
+| `data.users.alice` | (1) | 1 with `bob` data thrown away|
+|                    | (2) | 2 |
+|                    | (3) | 2 |
+
+
+For example, retrieving the full extent of `data.users` from the disk store
+will require a single key fetch with the partitions of (1).
+With (2), the storage engine will fetch two keys and their values.
+
+Retrieving a single user's data, e.g. `data.users.alice`, will require
+reading a single key and all the users data with (1); but throw away most
+of it: all the data not belonging to `alice`.
+
+There is no one-size-fits-all setting for partitions: good settings depend
+on the actual usage, and that comes down to the policies that are used with
+OPA.
+Commonly, you would optimize the partition settings for those queries that
+are performance critical.
+
+To figure out suboptimal partitioning, please have a look at the exposed
+metrics.
+
+## Metrics
+
+Using the [REST API](../rest-api/), you can include the `?metrics` query string
+to gain insights into the disk storage access related to a certain OPA query.
+
+```
+$ curl 'http://localhost:8181/v1/data/tenants/acme1/bindings/user1?metrics' | opa eval -I 'input.metrics' -fpretty
+{
+  "counter_disk_read_bytes": 339,
+  "counter_disk_read_keys": 3,
+  "counter_server_query_cache_hit": 1,
+  "timer_disk_read_ns": 40736,
+  "timer_rego_external_resolve_ns": 251,
+  "timer_rego_input_parse_ns": 656,
+  "timer_rego_query_eval_ns": 66616,
+  "timer_server_handler_ns": 117539
+}
+```
+
+The `timer_disk_*_ns` timers give an indication about how much time
+was spent with the different disk operations.
+
+Available timers are
+- `timer_disk_read_ns`
+- `timer_disk_write_ns`
+- `timer_disk_commit_ns`
+
+Also note the `counter_disk_*` counters in the metrics:
+
+- `counter_disk_read_keys`: number of keys retrieved
+- `counter_disk_written_keys`: number of keys written
+- `counter_disk_deleted_keys`: number of keys deleted
+- `counter_disk_read_bytes`: bytes retrieved
+
+Suboptimal partition settings can be spotted when the amount of
+keys and bytes retrieved for a query is unproportional to the
+actual data returned: the query likely had to retrieve a giant
+JSON object, and most of it was thrown away.
+
+## Debug Logging
+
+Pass `--log-level debug` to `opa run` to see all the underlying storage
+engine's logs.
+
+When debug logging is _enabled_, the service will output some
+statistics about the configured disk partitions and their key
+sizes.
+
+```
+[DEBUG] partition /tenants/acme3/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
+[DEBUG] partition /tenants/acme4/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
+[DEBUG] partition /tenants/acme8/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
+[DEBUG] partition /tenants/acme9/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
+[DEBUG] partition /tenants/acme0/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
+[DEBUG] partition /tenants/acme2/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
+[DEBUG] partition /tenants/acme6/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
+```
+
+Note that this process will iterate over all database keys.
+It only happens on startup, when debug logging is enabled.
+
+## Fine-tuning Badger settings (superflags)
+
+While partitioning should be the first thing to look into to tune the memory usage and
+performance of the on-disk storage engine, this configurable gives you the means to
+change many internal aspects of how Badger uses memory and disk storage.
+
+{{< danger >}}
+To be used with care!
+
+Any of the Badger settings used by OPA can be overridden using this feature.
+There is no validation happening for configurables set using this flag.
+
+When the embedded Badger version changes, these configurables could change,
+too.
+{{< /danger >}}
+
+The configurables correspond to Badger options that can be set on [the library's Options
+struct](https://pkg.go.dev/github.com/dgraph-io/badger/v3#Options).
+
+The following configurables can *not* be overridden:
+- `dir`
+- `valuedir`
+- `detectconflicts`
+
+Aside from conflict detection, Badger in OPA uses the default options [you can find here](https://github.com/dgraph-io/badger/blob/v3.2103.2/options.go#L128-L187).
+
+Conflict detection is disabled because the locking scheme used within OPA does not allow
+for having multiple concurrent writes.
+
+### Example
+
+```yaml
+storage:
+  disk:
+    directory: /tmp/disk
+    badger: nummemtables=1; numgoroutines=2; maxlevels=3
+```
diff --git a/docs/content/rest-api.md b/docs/content/rest-api.md
@@ -1002,6 +1002,10 @@ If the path does not refer to an existing document, the server will attempt to c
 
 The server will respect the `If-None-Match` header if it is set to `*`. In this case, the server will not overwrite an existing document located at the path.
 
+#### Query Parameters
+
+- **metrics** - Return performance metrics in addition to result. See [Performance Metrics](#performance-metrics) for more detail.
+
 #### Status Codes
 
 - **204** - no content (success)
@@ -1094,6 +1098,10 @@ Delete a document.
 
 The server processes the DELETE method as if the client had sent a PATCH request containing a single remove operation.
 
+#### Query Parameters
+
+- **metrics** - Return performance metrics in addition to result. See [Performance Metrics](#performance-metrics) for more detail.
+
 #### Status Codes
 
 - **204** - no content (success)

diff --git a/internal/config/config.go b/internal/config/config.go
@@ -83,7 +83,7 @@ func Load(configFile string, overrides []string, overrideFiles []string) ([]byte
 		processedConf := subEnvVars(string(bytes))
 
 		if err := yaml.Unmarshal([]byte(processedConf), &baseConf); err != nil {
-			return []byte{}, fmt.Errorf("failed to parse %s: %s", configFile, err)
+			return nil, fmt.Errorf("failed to parse %s: %s", configFile, err)
 		}
 	}
 
@@ -93,7 +93,7 @@ func Load(configFile string, overrides []string, overrideFiles []string) ([]byte
 	for _, override := range overrides {
 		processedOverride := subEnvVars(override)
 		if err := strvals.ParseInto(processedOverride, overrideConf); err != nil {
-			return []byte{}, fmt.Errorf("failed parsing --set data: %s", err)
+			return nil, fmt.Errorf("failed parsing --set data: %s", err)
 		}
 	}
 
@@ -105,7 +105,7 @@ func Load(configFile string, overrides []string, overrideFiles []string) ([]byte
 			return value, err
 		}
 		if err := strvals.ParseIntoFile(override, overrideConf, reader); err != nil {
-			return []byte{}, fmt.Errorf("failed parsing --set-file data: %s", err)
+			return nil, fmt.Errorf("failed parsing --set-file data: %s", err)
 		}
 	}
 

diff --git a/plugins/bundle/plugin.go b/plugins/bundle/plugin.go
@@ -188,7 +188,7 @@ func (p *Plugin) Reconfigure(ctx context.Context, config interface{}) {
 
 	// Deactivate the bundles that were removed
 	params := storage.WriteParams
-	params.Context = storage.NewContext()
+	params.Context = storage.NewContext() // TODO(sr): metrics?
 	err := storage.Txn(ctx, p.manager.Store, params, func(txn storage.Transaction) error {
 		opts := &bundle.DeactivateOpts{
 			Ctx:         ctx,
@@ -513,7 +513,7 @@ func (p *Plugin) activate(ctx context.Context, name string, b *bundle.Bundle) er
 	p.log(name).Debug("Bundle activation in progress. Opening storage transaction.")
 
 	params := storage.WriteParams
-	params.Context = storage.NewContext()
+	params.Context = storage.NewContext().WithMetrics(p.status[name].Metrics)
 
 	err := storage.Txn(ctx, p.manager.Store, params, func(txn storage.Transaction) error {
 		p.log(name).Debug("Opened storage transaction (%v).", txn.ID())