Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropping rocks column families instead of running compaction #16589

Closed
carllin opened this issue Apr 16, 2021 · 16 comments
Closed

Dropping rocks column families instead of running compaction #16589

carllin opened this issue Apr 16, 2021 · 16 comments

Comments

@carllin
Copy link
Contributor

carllin commented Apr 16, 2021

Problem

Rocks compaction can stall writes when the dataset is sufficiently large

Proposed Solution

Drop column families and create new ones instead of running compaction

Pending work items:

Work Item 1: Upstream change to destroy column family handles

Follow-up to Ryo's upstream change here: rust-rocksdb/rust-rocksdb#506. The current drop_cf() implementation in the upstream crate doesn't free the memory. There's a difference between "dropping" the column family and "destroying" all the outstanding Rocksdb handles to the column family, explained here: https://github.com/facebook/rocksdb/blob/master/db/column_family.h#L292-L299.

More details below.

Currently, the Rocks crate we're using exposes a ColumnFamily object

pub struct ColumnFamily {
    pub(crate) inner: *mut ffi::rocksdb_column_family_handle_t,
}

that wraps a single reference to a Rocksdb handle/rocksdb pointer to a C++ object: https://github.com/facebook/rocksdb/blob/master/db/c.cc#L973-L974 rocksdb_column_family_handle_t* handle = new rocksdb_column_family_handle_t;, which aliases an object defined here:

class ColumnFamilyHandle {
 public:
  virtual ~ColumnFamilyHandle() {}
  // Returns the name of the column family associated with the current handle.
  virtual const std::string& GetName() const = 0;
  // Returns the ID of the column family associated with the current handle.
  virtual uint32_t GetID() const = 0;
  // Fills "*desc" with the up-to-date descriptor of the column family
  // associated with this handle. Since it fills "*desc" with the up-to-date
  // information, this call might internally lock and release DB mutex to
  // access the up-to-date CF options.  In addition, all the pointer-typed
  // options cannot be referenced any longer than the original options exist.
  //
  // Note that this function is not supported in RocksDBLite.
  virtual Status GetDescriptor(ColumnFamilyDescriptor* desc) = 0;
  // Returns the comparator of the column family associated with the
  // current handle.
  virtual const Comparator* GetComparator() const = 0;
};

where the destructor ~ColumnFamilyHandle() is what decrements the refcount to the column family and performs cleanup of old files: https://github.com/facebook/rocksdb/blob/master/db/column_family.cc#L68-L84. Thus until this runs, we won't free the memory.

In the rust-rocksdb crate however, the destructor however is only invoked through the rocksrocksdb_column_family_handle_destroy() https://github.com/facebook/rocksdb/blob/master/db/c.cc#L997-L1000, which is not what's called by drop_cf: https://github.com/rust-rocksdb/rust-rocksdb/blob/master/src/db.rs#L687

Because we want the ability to destroy the column family without closing the entire database, this seems to imply in the upstream rust-rocksdb crate we would have to both:

  1. Wrap the ColumnFamily into Arc<ColumnFamily>
  2. Expose a destroy_cf() method
  3. Implement Drop() implementation for ColumnFamily that takes a custom handler supplied by the consumers of the crate. See Work Item 2 for how we use this.
Work Item 2: Safely consume work item 1 in our own services

The compaction/cleanup thread will call drop_cf() during cleanup. At this point, if Work Item 1 is implemented, then the underlying rust-rocksdb crate will drop the reference to the ColumnFamily object from its hashmap of column family handles. This means the only remaining references to the ColumnFamily object are our services that are still performing a lookup concurrently that started the lookup before the call to drop_cf. Once the last reference to a column family is dropped, the handler we passed to the rust-rocksdb crate (described above in Work Item 1) will be called.

Now what should this handler do? First we need to find a place in our services to safely call destroy on the dropped ColumnFamily object. One straightforward way to do this might be to call the destroy() method on the dropped column family. However, one thing to note here is we don't want critical threads like ReplayStage or WindowService to eat the latency of dropping the column family, if they were so unfortunate as to hold the last reference to that column family. For instance, let's say we run into a scenario like

1) ReplayStage grabs the Arc<ColumnFamily> to do a lookup from blockstore
2) Our cleanup thread calls drop_cf()
3) ReplayStage finishes lookup, drops `Arc<ColumnFamily>`, eats the  destroy() latency

So alternatively the handler we pass could instead use a channel to pipe a signal to our clean up thread. The cleanup thread then will call the newly defined destroy_cf() method (described above in Work Item 1) on the newly dropped column family.

Work Item 3: Make LedgerCleanupService use drop_cf()

The tricky part here is figuring out how to bucket slots into different column families.

Unlike transaction statuses which are written out sequentially right now (because ReplayStage replays the blocks in order, so the transaction statuses from replay is also in order), slots can arrive out of order. For instance, if a node is catching up to the tip of the network, they may be repairing/replaying blocks in the slot range 1-1000, but receiving blocks in the range 5000+.

We wouldn't want to group all these disjoint ranges of shreds together in some unpredictable combination in a column family because ultimately the cleanup service is purging a consecutive range of "outdated" slots once the cluster root has moved sufficiently far ahead, or the size of the database has grown past a certain size. This means we need to be able to cleanly support API's like delete range of slots [N, M].

Another decision that needs to be made is whether we want to clean based on the number of slots (say if number of slots > some X), or the overall size of the outstanding column families has hit a certain size (this is what' done with the TransactionStatus column family today). The latter seems harder because slots can vary in the number of shreds they contain, and thus the size of each slot can vary dramatically.

One naive way to break up multiple column families for something like the data shreds for instance, is to allocate some N number of column families and assign a slot range to each of those column families. For instance we could say the first column family contains slots in the range [0-999], the second in the range [1000-1999], all the way up to [(N-1) * 1000, (N * 1000)).

A few things about this above approach that need refinement:

  1. What if we are replaying slot 5, but the tip of the network is at (N + 1) * 1000, which is outside of the range of the last column family? Maybe here we can write those very far future slots into a separate file-backed store that doesn't load RocksDb and cause compaction. Then we can stream them into Rocks later when we start replaying that range of slots

  2. When do we drop and recreate each column family? A simple scheme would be to drop the column family at (M + N - 1) % N) where M is the column family containing the current root/confirmed block. Another option is to start dropping earlier column families if the combined size gets too large.

  3. However with the approach in 2, it's hard to drop column families based on size, as we currently do. If some slots in the first column family are prohibitively large, there's no good way to "spill over" into a later column family. Hopefully though, choosing the right interval size, i.e. something like 1000 will give us enough flexibility where the size of a single column family won't get too large, and you can drop ranges as described in 2 without losing too much recent history (i.e. if you're replaying slot 2000, you probably don't want to drop slot 1999 because others in the cluster may still need access to it, i.e. for repair, RPC queries, etc.)

Tag @CriesofCarrots, @ryoqun, @sakridge

context: #14586

@ryoqun
Copy link
Member

ryoqun commented Apr 16, 2021

(oh sad. we need another upstream pr?)

@ryoqun
Copy link
Member

ryoqun commented Apr 16, 2021

@carllin hmm, I think the fix should rather be simpler:

it seems this is kind of a mem leak bug in the rust-rocksdb. drop_cf()-ed column families' handles aren't matched with rocksdb_column_family_handle_destroy() in DB::drop_cf_all(). so calling this inside drop_cf should be actually expected?

Then, our code just pass around the to-be-dropped column family names to the background cf cleaning thread then do the actual drop cf process there?

 impl DBWithThreadMode<MultiThreaded> {
     /// Creates column family with given name and options
     pub fn create_cf<N: AsRef<str>>(&self, name: N, opts: &Options) -> Result<(), Error> {
         let inner = self.create_inner_cf_handle(name.as_ref(), opts)?;
         self.cfs
             .cfs
             .write()
             .unwrap()
             .insert(name.as_ref().to_string(), ColumnFamily { inner });
         Ok(())
     }
 
     /// Drops the column family with the given name by internally locking the inner column
     /// family map. This avoids needing `&mut self` reference
     pub fn drop_cf(&self, name: &str) -> Result<(), Error> {
         let inner = self.inner;
         if let Some(cf) = self.cfs.cfs.write().unwrap().remove(name) {
             unsafe {
                 ffi_try!(ffi::rocksdb_drop_column_family(inner, cf.inner));
+                ffi::rocksdb_column_family_handle_destroy(cf.inner);
             }
             Ok(())
         } else {
             Err(Error::new(format!("Invalid column family: {}", name)))
         }
     }
 
     /// Returns the underlying column family handle
     pub fn cf_handle(&self, name: &str) -> Option<BoundColumnFamily> {
         self.cfs
             .cfs
             .read()
             .unwrap()
             .get(name)
             .map(|cf| BoundColumnFamily {
                 inner: cf.inner,
                 multi_threaded_cfs: PhantomData,
             })
     }
 }

@carllin
Copy link
Contributor Author

carllin commented Apr 16, 2021

@ryoqun I don't think that call to ffi::rocksdb_column_family_handle_destroy(cf.inner); is necessarily safe if another thread is holding a reference to the column family with that name. The thread calling drop_cf() will destroy the underlying RocksDb handle and causing cleanup, but then another one of our service threads holding a reference to the rust_rocksdb ColumnFamily object may still try to do another operation on that object, which can lead to undefined behavior

@carllin
Copy link
Contributor Author

carllin commented Apr 17, 2021

I've updated some thoughts on Work Item 3 as well, which is the meat of the change, please take a look at that as well when you guys get the chance!

@ryoqun
Copy link
Member

ryoqun commented Apr 19, 2021

okay, I'm creating the 2nd upstream pr for this...

@sakridge
Copy link
Member

sakridge commented Apr 19, 2021

Why not go to this solution?: #16234

@CriesofCarrots
Copy link
Contributor

Why not go to this solution?: #16234

I was under the impression that @ryoqun wanted to implement the drop_cf() solution essentially as a stopgap to resolve compaction stalls while #16234 is in design/development. However, it seems like Work Item 3 above is almost as complex as the accounts db solution. I tend to think we should go all-in on the latter.

Work items 1 and 2 will still be useful for improving TransactionStatus column handling -> rocksdb perf for api nodes.

@carllin
Copy link
Contributor Author

carllin commented Apr 19, 2021

Yeah agreed, #16234 would be ideal, though I think it's just a trade-off on relative timelines for a stopgap. Here's my estimate:

I think the RocksDb drop_cf work could take around two weeks.

I think completely transitioning the data shreds to AccountsDb will take much longer.

For one AccountsDb is a much more complex piece of code, so it requires time to gain enough understanding to modify it and avoid the pitfalls/subtleties

There's also a large number of work items:

  1. Designing an index for AccountsDb that doesn't live in memory
  2. Rewiring all the insert/erasure logic to use AccountsDb
  3. Implementing all the API's that Rocks currently implements for data shreds - iterators, simulating atomic writes with other column families, figuring out what consistency guarantees will break.
  4. Read/write perf optimizations. Unlike AccountsDb, we cannot afford to keep all the shreds entirely in memory, so to achieve similar performance and keep recent shreds for recent slots around in memory, we'll have to manage a flushing policy, which is new.
  5. Consistency with clean. When you start dropping shreds in AccountsDb, you can no longer rely on batch writes to update other RocksDb column families atomically. You also can potentially break other API's like iterators that are reading shreds because we don't have Rocks' shadowing mechanism.
  6. Knowing AccountsDb, some other unforeseen complexity 👻

I think this work will all have to eventually be done, but I think it's orders of magnitude more work than the column family dropping work. I could see this taking a couple weeks of work for and testing for one person, depending on the complexity of the interactions between the components.

I know @ryoqun was looking for a stopgap in the meantime, hence this proposal to scope it out

@ryoqun
Copy link
Member

ryoqun commented Apr 20, 2021

@sakridge thanks for bringing the obvious question. I lacked context. As @CriesofCarrots and @carllin said, I wanted just stalls to be fixed asap because partners are having this problem on daily basis..

So, this drop_cf idea came up as a stopgap along side the other custom compaction_filter idea. But then, it seems the drop_cf idea isn't that easy to implement than originally hoped.

I think transitioning to AccountsDb will take more time (@steviez how's the work for #16234 going?).

As a last try, let me try the compaction_filter stopgap in a day.

@steviez
Copy link
Contributor

steviez commented Apr 20, 2021

how's the work for #16234 going?

No actual progress on this yet - have been working on other smaller items as stepping stones.

At this point, I don't think I can accurately gauge how long the work will take for either approach, or how "bad" the issue is for the partners (obviously any issue is non-ideal but is this a small/medium/large/etc disruption). As such, was looking for opinions & guidance on which path forward seems better (factoring in all the risks / unknowns / fact that people are hitting this in the wild right now)

@ryoqun
Copy link
Member

ryoqun commented Apr 20, 2021

@steviez thanks for answering. This will be a challenging and fun engineering task for you. so, using stepping stone is a realistic strategy. I think this problem is adequately bad to assign you to this issue while there are many open issues we'd like to tackle on. :)

The significance of the problem is that validator stops processing at all for ~1 hour. So, we'd like to fix the stall in some way or another. Sooner, better. Currently, no one in the team knows the easy-to-implement and robust solution. There is some dilemma, unfortunately. So, you can see some confusion. :) In other words, it's completely up to you to think on how to proceed to tackle this problem.

btw, Have you looked our previous attempt for this problem? #9366 although the pr is pretty old and reverted. it might be useful to gauge what's needed to change.

@steviez
Copy link
Contributor

steviez commented Apr 20, 2021

Currently, no one in the team knows the easy-to-implement and robust solution

@ryoqun Thanks and understood - it sounds like maybe this is a situation where it will be better to just start prototyping and fail quickly instead of spinning my wheels over analyzing which approach may be better.

Have you looked our previous attempt for this problem?

I hadn't seen this - thanks for sending over.

@sakridge
Copy link
Member

Yeah agreed, #16234 would be ideal, though I think it's just a trade-off on relative timelines for a stopgap. Here's my estimate:

I think the RocksDb drop_cf work could take around two weeks.

That's if there are no more problems that we run into to that we have to debug with rocks. It sounds like we ran into a very basic issue in that dropping columns is not freeing the memory. Kind of tells me this path is not well tested. I don't really see this solution as being that straightforward.

I think #16602 might be a decent mitigation while we work on this. If we can get that merged, then the amount of data should be reduced in rocksdb and thus less data to compact.

@ryoqun
Copy link
Member

ryoqun commented Jul 27, 2021

@carllin I think this issue can be closed now? That's because it seems mostly write stalls are fixed by now, or is there still some concerns?

Work Item 1: Upstream change to destroy column family handles

Also, I think I've finished this part anyway. This relevant fix is finally landed in upstream and available in our codebase starting from: #18904

So, if we really wanted to resume this design, we can start i guess. (not saying we should, to be sure. xD)

@carllin
Copy link
Contributor Author

carllin commented Jul 27, 2021

@ryoqun, agreed, closed!

@carllin carllin closed this as completed Jul 27, 2021
@steviez steviez removed their assignment Jul 27, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Apr 2, 2022

This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants