Skip to content

Commit

Permalink
runtime+storage: integrate disk storage
Browse files Browse the repository at this point in the history
With this change, the disk backend (badger) becomes available for
use with the OPA runtime properly:

It can be configured using the `storage.disk` key in OPA's config
(see included documentation).

When enabled,
- any data or policies stored with OPA will persist over restarts
- per-query metrics related to disk usage are reported
- Prometheus metrics per storage operation are exported

The main intention behind this feature is to optimize memory usage:
OPA can now operate on more data than fits into the allotted memory
resources. It is NOT meant to be used as a primary source of truth:
there are no backup/restore or desaster recovery procedures -- you
MUST secure the means to restore the data stored with OPA's disk
storage by yourself.

See also open-policy-agent#4014. Future improvements around bundle loading are
planned.

Some notes on details:

storage/disk: impose same locking regime used with inmem

With this setup, we'll ensure:

- there is only one open write txn at a time
- there are any number of open read txns at a time
- writes are blocked when reads are inflight
- during a commit (and triggers being run), no read txns can be created

This is to ensure the same atomic policy update semantics when using
'disk" as we have with "inmem". We're basically opting out of badger's
currency control and transactionality guarantees. This is because we
cannot piggy back on that to ensure the atomic update we want.

There might be other ways -- using subscribers, and blocking in some
other place -- but this one seems preferrable since it mirrors inmem.

Part of the problem is ErrTxnTooLarge, and committing and renewing
txns when it occurs: that, which is the prescribed solution to txns
growing too big, also means that reads can see half of the "logical"
transaction having been committed, while the rest is still getting
processed.

Another approach would have been using `WriteBatch`, but that won't
let us read from the batch, only apply Set and Delete operations.
We currently need to read (via an iterator) to figure out if we
need to delete keys to replace something in the store.  There is
no DropPrefix operation on the badger txn, or the WriteBatch API.

storage/disk: remove commit-and-renew-txn code for txn-too-big errors

This would break transactional guarantees we care about: while there
can be only one write transaction at a time, read transactions may
happen while a write txn is underway -- with this commit-and-reset
logic, those would read partial data.

Now, the error will be returned to the caller. The maximum txn size
depends on the size of memtables, and could be tweaked manually.
In general, the caller should try to push multiple smaller increments
of the data.

storage/disk: implement noop MakeDir

The MakeDir operation as implemented in the backend-agnostic storage
code has become an issue with the disk store: to write /foo/bar/baz,
we'd have to read /foo (among other subdirs), and that can be _much_
work for the disk backend. With inmem, it's cheap, so this wasn't
problematic before.

Some of the storage/disk/txn.go logic had to be adjusted to properly
do the MakeDir steps implicitly.

The index argument addition to patch() in storage/disk/txn.go was
necessary to keep the error messages conforming to the previous
code path: previously, conflicts (arrays indexed as objects) would
be surfaced in the MakeDir step, now it's entangled with the patch
calculation.

storage/disk: check ctx.Err() in List/Get operations

This won't abort reading a single key, but it will abort iterations.

storage/disk: support patterns in partitions

There is a potential clash here: "*", the path wildcard, is
a valid path section. However, it only affects the case when
a user would want to have a partition at

    /foo/*/bar

and would really mean "*", and not the wildcard.

Storing data at /foo/*/bar with a literal "*" won't be treated
differently than storing something at /fo/xyz/bar.

storage/disk: keep per-txn-type histograms of stats

This is done by reading off the metrics on commit, and shovelling
their numbers into the prometheus collector.

NOTE: if you were to share a metrics object among multiple transactions,
the results would be skewed, as it's not reset. However, our server
handlers don't do that.

storage/disk: opt out of badger's conflict detection

With only one write transaction in flight at any time, the situation
that badger guards against cannot happen:

A transaction has written to a key after the current, to-be-committed
transaction has last read that key from the store.

Since it can't happen, we can ignore the bookkeeping involved. This
improves the time it takes to overwrite existing keys.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
  • Loading branch information
srenatus authored and rokkiter committed Apr 18, 2022
1 parent ef31026 commit 0a8a933
Show file tree
Hide file tree
Showing 39 changed files with 2,476 additions and 506 deletions.
3 changes: 3 additions & 0 deletions config/config.go
Expand Up @@ -34,6 +34,9 @@ type Config struct {
Caching json.RawMessage `json:"caching,omitempty"`
PersistenceDirectory *string `json:"persistence_directory,omitempty"`
DistributedTracing json.RawMessage `json:"distributed_tracing,omitempty"`
Storage *struct {
Disk json.RawMessage `json:"disk,omitempty"`
} `json:"storage,omitempty"`
}

// ParseConfig returns a valid Config object with defaults injected. The id
Expand Down
17 changes: 17 additions & 0 deletions docs/content/configuration.md
Expand Up @@ -837,3 +837,20 @@ The following encryption methods are supported:
| `off` | Disable TLS |
| `tls` | Enable TLS |
| `mtls` | Enable mutual TLS |

### Disk Storage

The `storage` configuration key allows for enabling, and configuring, the
persistent on-disk storage of an OPA instance.

If `disk` is set to something, the server will enable the on-disk store
with data put into the configured `directory`.

| Field | Type | Required | Description |
| --- | --- | --- | --- |
| `storage.disk.directory` | `string` | Yes | This is the directory to use for storing the persistent database. |
| `storage.disk.auto_create` | `bool` | No (default: `false`) | If set to true, the configured directory will be created if it does not exist. |
| `storage.disk.partitions` | `array[string]` | No | Non-overlapping `data` prefixes used for partitioning the data on disk. |
| `storage.disk.badger` | `string` | No (default: empty) | "Superflags" passed to Badger allowing to modify advanced options. |

See [the docs on disk storage](../misc-disk/) for details about the settings.
177 changes: 177 additions & 0 deletions docs/content/misc-disk.md
@@ -0,0 +1,177 @@
---
title: Disk Storage
kind: misc
weight: 10
---

This page outlines configuration options relevant to using the disk storage
feature of OPA.
Configuration options are to be found in [the configuration docs](../configuration/#disk-storage).

{{< info >}}
The persistent disk storage enables OPA to work with data that does not fit
into the memory resources granted to the OPA server.
It is **not** supposed to be used as the primary source of truth for that data.

The on-disk storage should be considered ephemeral: you need to secure the
means to restore that data.
Backup and restore, or repair procedures for data corruption are not provided
at this time.
{{< /info >}}

## Partitions

Partitions determine how the JSON data is split up when stored in the
underlying key-value store.
For example, this table shows how an example document would be stored given
different configured partitions:

```json
{
"users": {
"alice": { "roles": ["admin"] },
"bob": { "roles": ["viewer"] }
}
}
```

| Partitions | Keys | Values |
| --- | --- | --- |
| (1) none | `/users` | `{"alice": {"roles": ["admin"]}, "bob": {"roles": ["viewer"]}}}` |
| --- | --- | --- |
| (2) `/users` | `/users/alice` | `{"roles": ["admin"]}` |
| | `/users/bob` | `{"roles": ["viewer"]}` |
| --- | --- | --- |
| (3) `/users/*` | `/users/alice/roles` | `["admin"]` |
| | `/users/bob/roles` | `["viewer"]` |

Partitioning has consequences on performance: in the example above, the
number of keys to retrieve from the database (and the amount of data of
its values) varies.

| Query | Partitions | Number of keys read |
| --- | --- | --- |
| `data.users` | (1) | 1 |
| | (2) | 2 |
| | (3) | 2 |
| --- | --- | --- |
| `data.users.alice` | (1) | 1 with `bob` data thrown away|
| | (2) | 2 |
| | (3) | 2 |


For example, retrieving the full extent of `data.users` from the disk store
will require a single key fetch with the partitions of (1).
With (2), the storage engine will fetch two keys and their values.

Retrieving a single user's data, e.g. `data.users.alice`, will require
reading a single key and all the users data with (1); but throw away most
of it: all the data not belonging to `alice`.

There is no one-size-fits-all setting for partitions: good settings depend
on the actual usage, and that comes down to the policies that are used with
OPA.
Commonly, you would optimize the partition settings for those queries that
are performance critical.

To figure out suboptimal partitioning, please have a look at the exposed
metrics.

## Metrics

Using the [REST API](../rest-api/), you can include the `?metrics` query string
to gain insights into the disk storage access related to a certain OPA query.

```
$ curl 'http://localhost:8181/v1/data/tenants/acme1/bindings/user1?metrics' | opa eval -I 'input.metrics' -fpretty
{
"counter_disk_read_bytes": 339,
"counter_disk_read_keys": 3,
"counter_server_query_cache_hit": 1,
"timer_disk_read_ns": 40736,
"timer_rego_external_resolve_ns": 251,
"timer_rego_input_parse_ns": 656,
"timer_rego_query_eval_ns": 66616,
"timer_server_handler_ns": 117539
}
```

The `timer_disk_*_ns` timers give an indication about how much time
was spent with the different disk operations.

Available timers are
- `timer_disk_read_ns`
- `timer_disk_write_ns`
- `timer_disk_commit_ns`

Also note the `counter_disk_*` counters in the metrics:

- `counter_disk_read_keys`: number of keys retrieved
- `counter_disk_written_keys`: number of keys written
- `counter_disk_deleted_keys`: number of keys deleted
- `counter_disk_read_bytes`: bytes retrieved

Suboptimal partition settings can be spotted when the amount of
keys and bytes retrieved for a query is unproportional to the
actual data returned: the query likely had to retrieve a giant
JSON object, and most of it was thrown away.

## Debug Logging

Pass `--log-level debug` to `opa run` to see all the underlying storage
engine's logs.

When debug logging is _enabled_, the service will output some
statistics about the configured disk partitions and their key
sizes.

```
[DEBUG] partition /tenants/acme3/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
[DEBUG] partition /tenants/acme4/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
[DEBUG] partition /tenants/acme8/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
[DEBUG] partition /tenants/acme9/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
[DEBUG] partition /tenants/acme0/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
[DEBUG] partition /tenants/acme2/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
[DEBUG] partition /tenants/acme6/bindings (pattern /tenants/*/bindings): key count: 10000 (estimated size 598890 bytes)
```

Note that this process will iterate over all database keys.
It only happens on startup, when debug logging is enabled.

## Fine-tuning Badger settings (superflags)

While partitioning should be the first thing to look into to tune the memory usage and
performance of the on-disk storage engine, this configurable gives you the means to
change many internal aspects of how Badger uses memory and disk storage.

{{< danger >}}
To be used with care!

Any of the Badger settings used by OPA can be overridden using this feature.
There is no validation happening for configurables set using this flag.

When the embedded Badger version changes, these configurables could change,
too.
{{< /danger >}}

The configurables correspond to Badger options that can be set on [the library's Options
struct](https://pkg.go.dev/github.com/dgraph-io/badger/v3#Options).

The following configurables can *not* be overridden:
- `dir`
- `valuedir`
- `detectconflicts`

Aside from conflict detection, Badger in OPA uses the default options [you can find here](https://github.com/dgraph-io/badger/blob/v3.2103.2/options.go#L128-L187).

Conflict detection is disabled because the locking scheme used within OPA does not allow
for having multiple concurrent writes.

### Example

```yaml
storage:
disk:
directory: /tmp/disk
badger: nummemtables=1; numgoroutines=2; maxlevels=3
```
8 changes: 8 additions & 0 deletions docs/content/rest-api.md
Expand Up @@ -1002,6 +1002,10 @@ If the path does not refer to an existing document, the server will attempt to c

The server will respect the `If-None-Match` header if it is set to `*`. In this case, the server will not overwrite an existing document located at the path.

#### Query Parameters

- **metrics** - Return performance metrics in addition to result. See [Performance Metrics](#performance-metrics) for more detail.

#### Status Codes

- **204** - no content (success)
Expand Down Expand Up @@ -1094,6 +1098,10 @@ Delete a document.

The server processes the DELETE method as if the client had sent a PATCH request containing a single remove operation.

#### Query Parameters

- **metrics** - Return performance metrics in addition to result. See [Performance Metrics](#performance-metrics) for more detail.

#### Status Codes

- **204** - no content (success)
Expand Down
6 changes: 3 additions & 3 deletions internal/config/config.go
Expand Up @@ -83,7 +83,7 @@ func Load(configFile string, overrides []string, overrideFiles []string) ([]byte
processedConf := subEnvVars(string(bytes))

if err := yaml.Unmarshal([]byte(processedConf), &baseConf); err != nil {
return []byte{}, fmt.Errorf("failed to parse %s: %s", configFile, err)
return nil, fmt.Errorf("failed to parse %s: %s", configFile, err)
}
}

Expand All @@ -93,7 +93,7 @@ func Load(configFile string, overrides []string, overrideFiles []string) ([]byte
for _, override := range overrides {
processedOverride := subEnvVars(override)
if err := strvals.ParseInto(processedOverride, overrideConf); err != nil {
return []byte{}, fmt.Errorf("failed parsing --set data: %s", err)
return nil, fmt.Errorf("failed parsing --set data: %s", err)
}
}

Expand All @@ -105,7 +105,7 @@ func Load(configFile string, overrides []string, overrideFiles []string) ([]byte
return value, err
}
if err := strvals.ParseIntoFile(override, overrideConf, reader); err != nil {
return []byte{}, fmt.Errorf("failed parsing --set-file data: %s", err)
return nil, fmt.Errorf("failed parsing --set-file data: %s", err)
}
}

Expand Down
4 changes: 2 additions & 2 deletions plugins/bundle/plugin.go
Expand Up @@ -188,7 +188,7 @@ func (p *Plugin) Reconfigure(ctx context.Context, config interface{}) {

// Deactivate the bundles that were removed
params := storage.WriteParams
params.Context = storage.NewContext()
params.Context = storage.NewContext() // TODO(sr): metrics?
err := storage.Txn(ctx, p.manager.Store, params, func(txn storage.Transaction) error {
opts := &bundle.DeactivateOpts{
Ctx: ctx,
Expand Down Expand Up @@ -513,7 +513,7 @@ func (p *Plugin) activate(ctx context.Context, name string, b *bundle.Bundle) er
p.log(name).Debug("Bundle activation in progress. Opening storage transaction.")

params := storage.WriteParams
params.Context = storage.NewContext()
params.Context = storage.NewContext().WithMetrics(p.status[name].Metrics)

err := storage.Txn(ctx, p.manager.Store, params, func(txn storage.Transaction) error {
p.log(name).Debug("Opened storage transaction (%v).", txn.ID())
Expand Down

0 comments on commit 0a8a933

Please sign in to comment.