storage: factor StorageCollections out of StorageController #27057

aljoscha · 2024-05-13T13:04:53Z

Note

The first commits are fixes that are required to pass CI, I have opened separate PRs for them. You should only have to look at the last commit.

Preparatory work for #24845, where we want to introduce more concurrency
to the Coordinator and Controllers.

The considerations/design are described in
doc/developer/design/20240117_decoupled_storage_controller.md

This is an intermediate step where we factor a StorageCollections out
of StorageController, and let the StorageController use it's
interface instead of holding collections state/sinces itself.

One of the next steps is to change usage sites of StorageController to
use their own handle to a StorageCollections, bypassing the
StorageController for query-processing (PEEKS, SUBSCRIBE, etc.) code
paths. This will let us introduce more concurrency in the Coordinator
and do less work in the main Coordinator loop.

The important parts in this change are:

StorageCollections::new and StorageController::new: these closely
mirror each other.
StorageCollections::create_collections and
StorageController::create_collections, ditto!
A lot of the rest is "boilerplate", passing through calls to the
internal StorageCollections, and using StorageCollections from the
controller instead of using owned state.
The last interesting thing to lock at is the new BackgroundTask: it
takes over the work that persist_handles::PersistReadWorker was
doing before plus it continually listens for upper changes and
forwards the since frontier of collections/their since handles.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

aljoscha · 2024-05-14T14:38:34Z

@nrainer-materialize I kicked of a nightly run already because this is quite a deep change for how the StorageController works

shepherdlybot · 2024-05-14T17:31:28Z

Mitigations

Completing required mitigations increases Resilience Coverage.

(Required) Code Review 🔍 Detected
(Required) Feature Flag
(Required) Integration Test
(Required) Observability 🔍 Detected
(Required) QA Review 🔍 Detected
(Required) Run Nightly Tests
Unit Test

Risk Summary:

The pull request carries a high risk, with a score of 82, indicating a significant chance of introducing bugs. This assessment is based on predictors such as the average age of files, cognitive complexity within files, and changes to executable lines. Historically, pull requests with similar characteristics are 117% more likely to cause a bug compared to the repository baseline. Additionally, 4 files modified in this pull request have a recent history of frequent bug fixes, which contributes to the risk. The repository's observed bug trend is on an upward trajectory, although this is not directly tied to the risk score.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File	Percentile
../sequencer/inner.rs	99
../src/lib.rs	100
../src/controller.rs	94
../controller/instance.rs	91

aljoscha · 2024-05-15T09:16:00Z

@nrainer-materialize there is a regression in the feature benchmark, but I don't see how the change could only affect full outer joins. It might be a flake?

Other than that, nightly seems good? I did restart RQG dbt3-joins workload once because it timed out.

nrainer-materialize · 2024-05-15T09:38:20Z

@nrainer-materialize there is a regression in the feature benchmark, but I don't see how the change could only affect full outer joins. It might be a flake?

Let's trigger a retry and see...

jkosh44

Adapter parts LGTM, leaving the main review to storage folks.

jkosh44 · 2024-05-15T13:33:44Z

src/storage-controller/src/persist_handles.rs

+//     use std::sync::Arc;
+//
+//     use mz_build_info::DUMMY_BUILD_INFO;
+//     use mz_ore::metrics::MetricsRegistry;
+//     use mz_ore::now::SYSTEM_TIME;
+//     use mz_persist_client::cache::PersistClientCache;
+//     use mz_persist_client::cfg::PersistConfig;
+//     use mz_persist_client::rpc::PubSubClientConnection;
+//     use mz_persist_client::{Diagnostics, PersistClient, PersistLocation, ShardId};
+//     use mz_persist_types::codec_impls::UnitSchema;
+//     use mz_repr::{RelationDesc, Row};
+//
+//     use super::*;
+//
+//     #[mz_ore::test(tokio::test)]
+//     #[cfg_attr(miri, ignore)] // unsupported operation: integer-to-pointer casts and `ptr::from_exposed_addr`
+//     async fn snapshot_stats(&self) {
+//         let client = PersistClientCache::new(
+//             PersistConfig::new_default_configs(&DUMMY_BUILD_INFO, SYSTEM_TIME.clone()),
+//             &MetricsRegistry::new(),
+//             |_, _| PubSubClientConnection::noop(),
+//         )
+//         .open(PersistLocation {
+//             blob_uri: "mem://".to_owned(),
+//             consensus_uri: "mem://".to_owned(),
+//         })
+//         .await
+//         .unwrap();
+//         let shard_id = ShardId::new();
+//         let since_handle = client
+//             .open_critical_since(
+//                 shard_id,
+//                 PersistClient::CONTROLLER_CRITICAL_SINCE,
+//                 Diagnostics::for_tests(),
+//             )
+//             .await
+//             .unwrap();
+//         let mut write_handle = client
+//             .open_writer::<SourceData, (), u64, i64>(
+//                 shard_id,
+//                 Arc::new(RelationDesc::empty()),
+//                 Arc::new(UnitSchema),
+//                 Diagnostics::for_tests(),
+//             )
+//             .await
+//             .unwrap();
+//
+//         let worker = PersistReadWorker::<u64>::new();
+//         worker.register(GlobalId::User(1), since_handle);
+//
+//         // No stats for unknown GlobalId.
+//         let stats = worker
+//             .snapshot_stats(
+//                 GlobalId::User(2),
+//                 SnapshotStatsAsOf::Direct(Antichain::from_elem(0)),
+//             )
+//             .await;
+//         assert!(stats.is_err());
+//
+//         // Stats don't resolve for as_of past the upper.
+//         let stats_fut = worker.snapshot_stats(
+//             GlobalId::User(1),
+//             SnapshotStatsAsOf::Direct(Antichain::from_elem(1)),
+//         );
+//         assert!(stats_fut.now_or_never().is_none());
+//         // Call it again because now_or_never consumed our future and it's not clone-able.
+//         let stats_ts1_fut = worker.snapshot_stats(
+//             GlobalId::User(1),
+//             SnapshotStatsAsOf::Direct(Antichain::from_elem(1)),
+//         );
+//
+//         // Write some data.
+//         let data = ((SourceData(Ok(Row::default())), ()), 0u64, 1i64);
+//         let () = write_handle
+//             .compare_and_append(&[data], Antichain::from_elem(0), Antichain::from_elem(1))
+//             .await
+//             .unwrap()
+//             .unwrap();
+//
+//         // Verify that we can resolve stats for ts 0 while the ts 1 stats call is outstanding.
+//         let stats = worker
+//             .snapshot_stats(
+//                 GlobalId::User(1),
+//                 SnapshotStatsAsOf::Direct(Antichain::from_elem(0)),
+//             )
+//             .await
+//             .unwrap();
+//         assert_eq!(stats.num_updates, 1);
+//
+//         // Write more data and unblock the ts 1 call
+//         let data = ((SourceData(Ok(Row::default())), ()), 1u64, 1i64);
+//         let () = write_handle
+//             .compare_and_append(&[data], Antichain::from_elem(1), Antichain::from_elem(2))
+//             .await
+//             .unwrap()
+//             .unwrap();
+//         let stats = stats_ts1_fut.await.unwrap();
+//         assert_eq!(stats.num_updates, 2);
+//     }
+// }


Is this coming in a later PR or later commit?

dayum! forgot about this one 🙈

omg I didnt scroll enough and almost re-asked this question in another comment

jkosh44 · 2024-05-15T13:53:44Z

src/storage-client/src/storage_collections.rs

+    /// Checks whether a collection exists under the given `GlobalId`. Returns
+    /// an error if the collection does not exist.
+    fn check_exists(&self, id: GlobalId) -> Result<(), StorageError<Self::Timestamp>>;


I'm not sure if this is existing code movement, but it's not clear to me if inactive (i.e. Collections that have been dropped by still have outstanding ReadHolds) are included in this.

jkosh44 · 2024-05-15T13:54:58Z

src/storage-client/src/storage_collections.rs

+    /// associated metadata needed to ingest the particular source.
+    ///
+    /// This command installs collection state for the indicated sources, and
+    /// the are now valid to use in queries at times beyond the initial `since`


Suggested change

/// the are now valid to use in queries at times beyond the initial `since`

/// they are now valid to use in queries at times beyond the initial `since`

jkosh44 · 2024-05-15T13:58:58Z

src/storage-client/src/storage_collections.rs

+    /// Drops the read capability for the sources and allows their resources to
+    /// be reclaimed.
+    ///
+    /// TODO(jkosh44): This method does not validate the provided identifiers.
+    /// Currently when the controller starts/restarts it has no durable state.
+    /// That means that it has no way of remembering any past commands sent. In
+    /// the future we plan on persisting state for the controller so that it is
+    /// aware of past commands. Therefore this method is for dropping sources
+    /// that we know to have been previously created, but have been forgotten by
+    /// the controller due to a restart. Once command history becomes durable we
+    /// can remove this method and use the normal `drop_sources`.
+    fn drop_collections_unvalidated(
+        &mut self,
+        storage_metadata: &StorageMetadata,
+        identifiers: Vec<GlobalId>,
+    );


Not necessary for this PR, but I'm just realizing that we can probably get rid of all these drop_.*_unvalidated methods an inline them into drop_.*.

guswynn

The use of non-persist feedback to move uppers was something that always annoyed me, so that change is a huge + in my opinion!

I also liked the use of extra_state on CollectionState, the alternative of having lots of random maps in the state was quite annoying!

Unfortunately, I think my review is going to require 1 round of back and forth; I asked some questions that I need answered to get enough context to review the rest of the read hold and storage controller changes!

guswynn · 2024-05-15T18:16:08Z

src/storage-controller/src/persist_handles.rs

+//     use std::sync::Arc;
+//
+//     use mz_build_info::DUMMY_BUILD_INFO;
+//     use mz_ore::metrics::MetricsRegistry;
+//     use mz_ore::now::SYSTEM_TIME;
+//     use mz_persist_client::cache::PersistClientCache;
+//     use mz_persist_client::cfg::PersistConfig;
+//     use mz_persist_client::rpc::PubSubClientConnection;
+//     use mz_persist_client::{Diagnostics, PersistClient, PersistLocation, ShardId};
+//     use mz_persist_types::codec_impls::UnitSchema;
+//     use mz_repr::{RelationDesc, Row};
+//
+//     use super::*;
+//
+//     #[mz_ore::test(tokio::test)]
+//     #[cfg_attr(miri, ignore)] // unsupported operation: integer-to-pointer casts and `ptr::from_exposed_addr`
+//     async fn snapshot_stats(&self) {
+//         let client = PersistClientCache::new(
+//             PersistConfig::new_default_configs(&DUMMY_BUILD_INFO, SYSTEM_TIME.clone()),
+//             &MetricsRegistry::new(),
+//             |_, _| PubSubClientConnection::noop(),
+//         )
+//         .open(PersistLocation {
+//             blob_uri: "mem://".to_owned(),
+//             consensus_uri: "mem://".to_owned(),
+//         })
+//         .await
+//         .unwrap();
+//         let shard_id = ShardId::new();
+//         let since_handle = client
+//             .open_critical_since(
+//                 shard_id,
+//                 PersistClient::CONTROLLER_CRITICAL_SINCE,
+//                 Diagnostics::for_tests(),
+//             )
+//             .await
+//             .unwrap();
+//         let mut write_handle = client
+//             .open_writer::<SourceData, (), u64, i64>(
+//                 shard_id,
+//                 Arc::new(RelationDesc::empty()),
+//                 Arc::new(UnitSchema),
+//                 Diagnostics::for_tests(),
+//             )
+//             .await
+//             .unwrap();
+//
+//         let worker = PersistReadWorker::<u64>::new();
+//         worker.register(GlobalId::User(1), since_handle);
+//
+//         // No stats for unknown GlobalId.
+//         let stats = worker
+//             .snapshot_stats(
+//                 GlobalId::User(2),
+//                 SnapshotStatsAsOf::Direct(Antichain::from_elem(0)),
+//             )
+//             .await;
+//         assert!(stats.is_err());
+//
+//         // Stats don't resolve for as_of past the upper.
+//         let stats_fut = worker.snapshot_stats(
+//             GlobalId::User(1),
+//             SnapshotStatsAsOf::Direct(Antichain::from_elem(1)),
+//         );
+//         assert!(stats_fut.now_or_never().is_none());
+//         // Call it again because now_or_never consumed our future and it's not clone-able.
+//         let stats_ts1_fut = worker.snapshot_stats(
+//             GlobalId::User(1),
+//             SnapshotStatsAsOf::Direct(Antichain::from_elem(1)),
+//         );
+//
+//         // Write some data.
+//         let data = ((SourceData(Ok(Row::default())), ()), 0u64, 1i64);
+//         let () = write_handle
+//             .compare_and_append(&[data], Antichain::from_elem(0), Antichain::from_elem(1))
+//             .await
+//             .unwrap()
+//             .unwrap();
+//
+//         // Verify that we can resolve stats for ts 0 while the ts 1 stats call is outstanding.
+//         let stats = worker
+//             .snapshot_stats(
+//                 GlobalId::User(1),
+//                 SnapshotStatsAsOf::Direct(Antichain::from_elem(0)),
+//             )
+//             .await
+//             .unwrap();
+//         assert_eq!(stats.num_updates, 1);
+//
+//         // Write more data and unblock the ts 1 call
+//         let data = ((SourceData(Ok(Row::default())), ()), 1u64, 1i64);
+//         let () = write_handle
+//             .compare_and_append(&[data], Antichain::from_elem(1), Antichain::from_elem(2))
+//             .await
+//             .unwrap()
+//             .unwrap();
+//         let stats = stats_ts1_fut.await.unwrap();
+//         assert_eq!(stats.num_updates, 2);
+//     }
+// }


omg I didnt scroll enough and almost re-asked this question in another comment

guswynn · 2024-05-15T18:22:45Z

src/storage-controller/src/lib.rs

+            let dependency_read_holds = self
+                .storage_collections
+                .acquire_read_holds(storage_dependencies)
+                .expect("can acquire read holds");


What happens when we are restarting envd and recreating collections in this StorageCollections, but before we acquire read holds for deps? Couldn't the compute controller downgrade the since through its StorageCollections if it boots before us?

Currently that can't happen, because create_collections is always called from the StorageController. In the future, the protocol has to be that the StorageController get's a chance to run before anything else happens. And it (the StorageController that is "local" to that cluster/responsible for that cluster) will install read holds.

Also, for the future: there will be a SinceHandle per cluster, and on the cluster that is responsible for maintaining a collection, the StorageController acquires since holds through that. But the SinceHandle of other clusters doesn't need to be held back beyond what its ComputeController/the adapter have as requirements.

guswynn · 2024-05-15T18:24:03Z

src/storage-controller/src/lib.rs

+            if !dependency_read_holds.is_empty() {
+                let mut dependency_since = Antichain::from_elem(T::minimum());
+                for read_hold in dependency_read_holds.iter() {
+                    dependency_since.join_assign(read_hold.since());


I like using join instead of the bespoke logic above!!

also, we should calculate the dependency_since above the if statement, to avoid copying this code 2 times below!

guswynn · 2024-05-15T18:34:36Z

src/storage-controller/src/lib.rs

+                                let dropped_subsources = self
+                                    .dropped_ingestions
+                                    .remove(id)
+                                    .expect("missing dropped subsources");
+
+                                // The cluster is not sending these, so we take
+                                // matters into our own hands!
+                                tracing::debug!(?dropped_subsources, "synthesizing DroppedIds messages for subsources and the remap shard");
+                                self.internal_response_sender
+                                    .send(StorageResponse::DroppedIds(
+                                        dropped_subsources.into_iter().collect(),
+                                    ))
+                                    .expect("we are still alive");


I don't see the code this is replacing...is this fixing a separate known issue as part of this pr? Additionally, why don't we just find the set of subsources/remap shard at this moment, instead of using this intermediate state? My understanding is that self.collections still has all that data at this point?

It's new, and it fixes part of the issue that we're never cleaning up subsource state! I fixed this, because I was annoyed that the StorageController never drops the hold that it has with StorageCollections, meaning the latter never cleans up it's state.

Additionally, why don't we just find the set of subsources/remap shard at this moment, instead of using this intermediate state? My understanding is that self.collections still has all that data at this point?

Unfortunately, we don't have that state anymore. drop_sources_unvalidated removes the IngestionExport from the Ingestion, and I didn't want to touch that flow more than necessary. 🙈 In drop_sources_unvalidated, I now basically move the state over, so that we still have it here where we then synthesize messages. That whole flow around DroppedIds needs some love (before this PR and certainly also after!).

I agree, lets not touch flows more than necessary!

It's very clarifying to my reading of these changes to know this is new!

One thing I realized is that now that we no longer use FrontierUppers feedback (instead just watching the persist shard), we might not need to wait for DroppedIds before deleting state; this may let us avoid this dance, I think a TODO to reconsider that is worth adding here!

I consider DroppedIds to be a separate part of the protocol. Petros (not summoning him here 😅) has opinions about how the protocol should evolve, so I don't want to put anything down here right now.

guswynn · 2024-05-15T18:39:07Z

src/storage-controller/src/lib.rs

    // TODO(guswynn): we need to be more careful about the update time we get here:
    // <https://github.com/MaterializeInc/materialize/issues/25349>


As an aside: can we guarantee at this point, that the old storage controller is fenced out, and that the upper from .collections_frontier() below is linearizably the greatest upper?

I don't know! I think maybe we never could, and I think certainly information about uppers can always be outdated.

I think ideally we need to ensure another writer to a shard is fenced...ill have to come back to this; certainly as we end up with multiple controllers writing to the statistics shard, we have to be a bit careful

guswynn · 2024-05-15T18:42:01Z

src/storage-client/src/storage_collections.rs

+    fn collection_frontiers(
+        &self,
+        id: GlobalId,
+    ) -> Result<
+        (Antichain<Self::Timestamp>, Antichain<Self::Timestamp>),
+        StorageError<Self::Timestamp>,
+    >;


this should be a provided method built on top of collections_frontiers

guswynn · 2024-05-15T18:42:35Z

src/storage-client/src/storage_collections.rs

+    /// requested collections, ensures that we can get a consistent "snapshot"
+    /// of collection state. If we had separate methods instead, and/or would


What semantics does this snapshot have? There is nothing relating the upper frontier of any collections, right?

I put this comment in largely because collection state is now behind a lock and can be modified concurrently when you get frontiers (or other things) one-by-one. With this method we lock once, take the state we need, and release.

Should I just remove this comment about "snapshot"?

I would replace with with "atomically", i think!

guswynn · 2024-05-15T18:45:11Z

src/storage-client/src/storage_collections.rs

+    ///
+    /// This is a separate set from `finalizable_shards` because we know that
+    /// some environments have many, many finalizable shards that we are
+    /// struggling to finalize.


I'm not sure this explains the fact that we have both very well...finalized_shards is in fact cleared periodically, right?

I copied these as-is from StorageController, but yes, the explanation is not super clear. I'll give it a think.

guswynn

Structurally I am happy with this, thank you for cleaning up some of the messier code/structure of the controller while you were at it! I think the change to pubsub-for-upper-tracking as opposed to StorageResponse's is phenomenal, that always bugged me.

I have some minor nits, questions, and comments, as well as some slightly-more-substantial suggestions on how to wield async-rust! (Also I am trying to avoid getting nerdsniped into cleaning up our source dropping/shard finalization logic)

Other than those, I have 1 larger point: I want to look at the readhold/read capability code again. I am going to accept and not block on this, because deep in my soul I believe CI will catch any bugs, but I want to make the point that I think this frontier management stuff has gotten to the boundary of what is understandable by a single person. I think there are 3 things we can do:

firstly, at some point, modularize all the read capability code into its own module, instead of leaving it storage_collections.
Clarify the connection between ReadHolds, ReadPolicy's, and "read capabilities" (as far as I can tell a "read capability" is a frontier that encapsulates a ReadPolicy + all the outstanding read holds). I think that we could come up with an abstraction that unifies these into some set of subtypes, instead of having them spread around the controller state.
Rename (and clarify) what an "implied capability" is. As far as I can tell, its the since frontier of a collection as understood by the storage controller, ignoring anyone else's read holds. And, as far as I can tell, its only real use is to manage the ReadHold that the controller holds. This can also be abstracted into the ReadHold struct, I feel.
- Maybe this is just me but "implied" makes me think that its the capability that meets all ReadHolds??

I think this pr is good as is, and has absolutely made the right choice to leave things mostly as they were. But as we begin to split the storage controller into more pieces (and into separate clusters), I think we are going to hit huge issues with this complexity, which I think points to at least some future refactoring after this pr merges!

guswynn · 2024-05-16T21:58:15Z

src/storage-client/src/controller.rs

-    /// The capability (hold on the since) that this export needs from its
-    /// dependencies (inputs). When the upper of the export changes, we
-    /// downgrade this, which in turn downgrades holds we have on our
-    /// dependencies' sinces.
-    pub read_capability: Antichain<T>,


this drove me nuts; thank you for making it just a ReadHold

guswynn · 2024-05-16T21:59:48Z

src/storage-client/src/storage_collections.rs

+/// - Hands out [ReadHold] that prevent a collection's since from advancing
+///   while it needs to be read at a specific time.


Suggested change

/// - Hands out [ReadHold] that prevent a collection's since from advancing

/// while it needs to be read at a specific time.

/// - Hands out [ReadHold] that prevent a collection's since from advancing

/// while it needs to be read at a specific time.

Suggested change

/// - Hands out [ReadHold] that prevent a collection's since from advancing

/// while it needs to be read at a specific time.

/// - Hands out [ReadHold]s that prevent a collection's since from advancing

/// while it needs to be read at a specific time.

I'll change to [ReadHolds](ReadHold)

guswynn · 2024-05-16T22:00:15Z

src/storage-client/src/storage_collections.rs

+        init_ids: BTreeSet<GlobalId>,
+        drop_ids: BTreeSet<GlobalId>,


documenting these would be great!

guswynn · 2024-05-16T22:01:09Z

src/storage-client/src/storage_collections.rs

+    /// Marks the end of any initialization commands.
+    ///
+    /// The implementor may wait for this method to be called before
+    /// implementing prior commands, and so it is important for a user to invoke
+    /// this method as soon as it is comfortable. This method can be invoked
+    /// immediately, at the potential expense of performance.
+    fn initialization_complete(&mut self);


what commands?

I had the same question! On the controllers this refers to ComputeCommand/StorageCommand but StorageCollections is not concerned with these, right?

In my mind, any calls from the adapter/Coordinator to StorageCollections are commands, because you can also imagine them having a channel as interface between them. So I kept the wording from StorageController.

I'll now remove this method from the trait altogether, because it's existence and documentation confused both of you. 😅

guswynn · 2024-05-16T22:01:40Z

src/storage-client/src/storage_collections.rs

+    /// [StorageCollections::drop_collections]. Collections that have been
+    /// dropped by still have outstanding [ReadHolds](ReadHold) are not
+    /// considered active for this method.
+    fn active_collection_metadatas(&self) -> Vec<(GlobalId, CollectionMetadata)>;


nit: can we group this with collection_metadata?

guswynn · 2024-05-16T23:52:24Z

src/storage-controller/src/lib.rs

+                    // Webhooks and tables are dropped differently from
+                    // ingestions and other collections.


Suggested change

// Webhooks and tables are dropped differently from

// ingestions and other collections.

// Webhooks and tables are dropped differently from

// ingestions and other collections.

Suggested change

// Webhooks and tables are dropped differently from

// ingestions and other collections.

// Webhooks and tables are dropped differently from

// ingestions and other collections. We can immediately compact

// them, because they don't interact with clusterd.

guswynn · 2024-05-16T23:54:43Z

src/storage-controller/src/lib.rs

+                                let dropped_subsources = self
+                                    .dropped_ingestions
+                                    .remove(id)
+                                    .expect("missing dropped subsources");
+
+                                // The cluster is not sending these, so we take
+                                // matters into our own hands!
+                                tracing::debug!(?dropped_subsources, "synthesizing DroppedIds messages for subsources and the remap shard");
+                                self.internal_response_sender
+                                    .send(StorageResponse::DroppedIds(
+                                        dropped_subsources.into_iter().collect(),
+                                    ))
+                                    .expect("we are still alive");


One thing I realized is that now that we no longer use FrontierUppers feedback (instead just watching the persist shard), we might not need to wait for DroppedIds before deleting state; this may let us avoid this dance, I think a TODO to reconsider that is worth adding here!

guswynn · 2024-05-16T23:55:04Z

src/storage-controller/src/lib.rs

+                            }
+                            CollectionStateExtra::None => {
+                                // No read holds for other types of collections!
+                                tracing::info!("DroppedIds for collection {id}");


nit: debug!

guswynn · 2024-05-16T23:57:32Z

src/storage-controller/src/lib.rs

+    #[instrument(level = "debug", fields(updates))]
+    fn update_write_frontiers(&mut self, updates: &[(GlobalId, Antichain<T>)]) {


document that its only for compute?

Ideally we could use persist feedback to drive this in the future, as well, right?

It's actually not for compute, it's only used internally in the controller. I left the controller/cluster protocol unchanged, so we still get FrontierUppers and DroppedIds, and we drive around the read holds of ingestions based on that. It could be changed, but I didn't want to go that far.

guswynn · 2024-05-17T00:07:00Z

src/storage-controller/src/lib.rs

+    /// The policy that drives how we downgrade our read hold. That is how we
+    /// derive our since from our upper.
+    pub hold_policy: ReadPolicy<T>,


could we replace this (and all the derive_since and read_capabilities stuff) if we had a way to add a read policy to the policies set externally? Not a blocker, but definitely a bit weird that hand-manage a policy just to hold some ReadHold's correctly

I actually like it like this: StorageCollections has a policy that is set by the adapter. The StorageController has a policy or for its internal needs, that it uses to drive forward its read handle that it has at StorageCollections.

The ReadPolicy basically encodes: "have we been dropped". And we could probably make that more explicit. But I again didn't want to go down that road right now.

consider me convinced! might be worth in the comment saying: This is a _storage-controller-internal_ policy used to derive its personal read hold on the collection.

teskje

Compute parts lgtm. I only skimmed over the storage controller changes.

teskje · 2024-05-22T12:13:33Z

src/compute-client/src/controller/instance.rs

-            })
-            .collect();
-        self.storage_controller
-            .update_write_frontiers(&storage_updates);


src/controller/src/lib.rs

teskje · 2024-05-22T12:40:23Z

src/storage-client/src/storage_collections.rs

+    /// Marks the end of any initialization commands.
+    ///
+    /// The implementor may wait for this method to be called before
+    /// implementing prior commands, and so it is important for a user to invoke
+    /// this method as soon as it is comfortable. This method can be invoked
+    /// immediately, at the potential expense of performance.
+    fn initialization_complete(&mut self);


I had the same question! On the controllers this refers to ComputeCommand/StorageCommand but StorageCollections is not concerned with these, right?

teskje · 2024-05-22T12:51:05Z

src/storage-client/src/storage_collections.rs

+    /// Applies `updates` and sends any appropriate compaction command.
+    ///
+    /// This is a legacy interface that should _not_ be used! It is only used by
+    /// the compute controller.


If it's only used by the compute controller, the answer is "hopefully soon". I plan on porting the compute controller to the new ReadHolds interface once the whole storage controller refactor is merged.

src/storage-client/src/storage_collections.rs

teskje · 2024-05-22T13:07:50Z

src/storage-client/src/storage_collections.rs

+#[derive(Debug, Clone)]
+pub struct StorageCollectionsImpl<
+    T: TimelyTimestamp + Lattice + Codec64 + From<EpochMillis> + TimestampManipulation,
+> {


I noticed that in contrast to the StorageController, for StorageCollections the trait impl does not live in a separate crate. AFAIU the purpose of the StorageController trait is to free clients from having to depend on the implementation crate. Is that understanding wrong or does the split have a different purpose for StorageCollections?

That was the motivation, yes! But all of StorageCollections is pretty much a "client thing", which is why it's impl can live in the client crate. I did keep the customary separation out of a sense of consistency, plus it does hide away some of the implementation details.

Fine for me! I think there is some friction introduced by having this trait in that "go to definition" leads you to the trait method definitions and there is no good way to get from there to the implementation (or at least I haven't found one apart from ctrl+f). But the point about hiding implementation details is also valid.

src/storage-client/src/controller.rs

aljoscha · 2024-05-23T13:31:53Z

@guswynn thanks for the very thorough review, I pushed a lot of individual fixup commits that address your comments!

@teskje also pushed commits for your comments. And also, thanks!

guswynn · 2024-05-23T17:36:54Z

fixup commits look good!

Preparatory work for MaterializeInc#24845, where we want to introduce more concurrency to the Coordinator and Controllers. The considerations/design are described in [doc/developer/design/20240117_decoupled_storage_controller.md](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/20240117_decoupled_storage_controller.md) This is an intermediate step where we factor a `StorageCollections` out of `StorageController`, and let the `StorageController` use it's interface instead of holding collections state/sinces itself. Note that there are now a lot of methods in `StorageController` that pass through to `StorageCollections`. A large part of this can be removed once we change usage sites to use a `StorageCollections` directly. One of the next steps is to change usage sites of `StorageController` to use their own handle to a `StorageCollections`, bypassing the `StorageController` for query-processing (PEEKS, SUBSCRIBE, etc.) code paths. This will let us introduce more concurrency in the Coordinator and do less work in the main Coordinator loop. The important parts in this change are: - `StorageCollections::new` and `StorageController::new`: these closely mirror each other. - `StorageCollections::create_collections` and `StorageController::create_collections`, ditto! - A lot of the rest is "boilerplate", passing through calls to the internal `StorageCollections`, and using `StorageCollections` from the controller instead of using owned state. - The last interesting thing to look at is the new `BackgroundTask`: it takes over the work that `persist_handles::PersistReadWorker` was doing before plus it continually listens for upper changes and forwards the since frontier of collections/their since handles.

Fixes MaterializeInc#27304 The inline comment explains the mechanism/problem that this "fixes". My recent change that factors a `StorageCollections` out of the `StorageController` (MaterializeInc#27057) made an existing bug more problematic. See below! Before the mentioned change, this could happen: 1. create cluster c 2. create source/sink on cluster c 3. do a DROP CLUSTER c cascade 4. cluster processes are killed before they get a chance to send back DroppedIds messages 5. controller does not clean out state about that collection (until the next restart) With my refactor, this would happen, which manifests in the observed bug: 1. create cluster c 2. create source/sink on cluster c, **this acquires a read hold at StorageCollections and stores it in that collection's state** 3. do a DROP CLUSTER c cascade 4. cluster processes are killed before they get a chance to send back DroppedIds messages 5. controller does not clean out state about that collection (until the next restart) 6. **the read hold is never released**, meaning StorageCollections does not clean out some of its state and we still report frontiers for these "active collections"

Fixes MaterializeInc#27304 My recent change that factors a `StorageCollections` out of the `StorageController` (MaterializeInc#27057) made an existing bug more problematic. See below! Before the mentioned change, this could happen: 1. create cluster c 2. create source/sink on cluster c 3. do a DROP CLUSTER c cascade 4. cluster processes are killed before they get a chance to send back DroppedIds messages 5. controller does not clean out state about that collection (until the next restart) With my refactor, this would happen, which manifests in the observed bug: 1. create cluster c 2. create source/sink on cluster c, **this acquires a read hold at StorageCollections and stores it in that collection's state** 3. do a DROP CLUSTER c cascade 4. cluster processes are killed before they get a chance to send back DroppedIds messages 5. controller does not clean out state about that collection (until the next restart) 6. **the read hold is never released**, meaning StorageCollections does not clean out some of its state and we still report frontiers for these "active collections" Due to reasons (tm) my previous changes did make it so that we only drop read holds when getting a DroppedIds message: the previous code would only attempt shard finalization after getting a DroppedIds, and StorageCollections starts to attempt shard finalization when all read holds have been dropped. This preserved the previous behavior of StorageController. With this here change, we allow eagerly dropping the read holds (advancing their since to the empty frontier), which we previously did not allow, on purpose. This makes is so that we correctly drop their state, and no longer report their frontiers. But it also makes it so that we attempt shard finalization slightly earlier. I think that is okay, though.

aljoscha force-pushed the storage-factor-out-storage-collections branch 5 times, most recently from 9b45260 to 529bf9d Compare May 14, 2024 13:39

aljoscha mentioned this pull request May 14, 2024

[Epic] adapter: decouple Controllers and Coordinator #24845

Open

aljoscha marked this pull request as ready for review May 14, 2024 17:30

aljoscha requested review from a team as code owners May 14, 2024 17:30

aljoscha requested a review from jkosh44 May 14, 2024 17:30

jkosh44 reviewed May 15, 2024

View reviewed changes

guswynn reviewed May 15, 2024

View reviewed changes

guswynn approved these changes May 17, 2024

View reviewed changes

teskje approved these changes May 22, 2024

View reviewed changes

aljoscha force-pushed the storage-factor-out-storage-collections branch from a7970b7 to 2dff0af Compare May 23, 2024 14:12

aljoscha force-pushed the storage-factor-out-storage-collections branch from 2dff0af to 92e9b95 Compare May 24, 2024 09:26

aljoscha added 2 commits May 24, 2024 17:16

adapter: fix error message on alter_ingestion_connections

cd06c5f

aljoscha force-pushed the storage-factor-out-storage-collections branch from 92e9b95 to a986ee0 Compare May 24, 2024 15:36

aljoscha merged commit 01ea188 into MaterializeInc:main May 24, 2024
77 of 78 checks passed

aljoscha deleted the storage-factor-out-storage-collections branch May 24, 2024 16:42

nrainer-materialize mentioned this pull request May 27, 2024

controller-frontiers.td fails in "Full testdrive in cloudtests" #27304

Open

aljoscha mentioned this pull request May 27, 2024

storage: eagerly clear read holds when dropping collections #27307

Open

5 tasks

	/// the are now valid to use in queries at times beyond the initial `since`
	/// they are now valid to use in queries at times beyond the initial `since`

		// TODO(guswynn): we need to be more careful about the update time we get here:
		// <https://github.com/MaterializeInc/materialize/issues/25349>

		/// requested collections, ensures that we can get a consistent "snapshot"
		/// of collection state. If we had separate methods instead, and/or would

		/// - Hands out [ReadHold] that prevent a collection's since from advancing
		/// while it needs to be read at a specific time.

		// Webhooks and tables are dropped differently from
		// ingestions and other collections.

		#[instrument(level = "debug", fields(updates))]
		fn update_write_frontiers(&mut self, updates: &[(GlobalId, Antichain<T>)]) {

storage: factor StorageCollections out of StorageController #27057

storage: factor StorageCollections out of StorageController #27057

Conversation

aljoscha commented May 13, 2024 • edited

Checklist

aljoscha commented May 14, 2024

shepherdlybot bot commented May 14, 2024 • edited

Mitigations

aljoscha commented May 15, 2024 • edited

nrainer-materialize commented May 15, 2024

jkosh44 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guswynn left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guswynn May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guswynn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teskje left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aljoscha commented May 23, 2024

guswynn commented May 23, 2024

aljoscha commented May 13, 2024 •

edited

shepherdlybot bot commented May 14, 2024 •

edited

aljoscha commented May 15, 2024 •

edited

guswynn left a comment •

edited

guswynn May 16, 2024 •

edited