UnsafeRange panicking during shutdown #17223

ahrtr · 2024-01-09T17:39:48Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via security@etcd.io.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

Test case TestMaintenanceSnapshotCancel failed and panicking.

Refer to https://github.com/etcd-io/etcd/actions/runs/7463174417/job/20307221683?pr=17220

Based on the log, the reason should be that the backend has already been closed (the member is being stopped) before the snapshot operation,

etcd/server/storage/backend/backend.go

Line 331 in a2eb17c

b.batchTx.Commit()

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa6c27c]

goroutine 19029 [running]:
go.etcd.io/bbolt.(*Cursor).seek(0xc00044a578, {0x1aa05d4, 0x4, 0x4})
	/home/runner/go/pkg/mod/go.etcd.io/bbolt@v1.3.8/cursor.go:159 +0x7c
go.etcd.io/bbolt.(*Bucket).Bucket(0xc0009da638, {0x1aa05d4, 0x4, 0x4})
	/home/runner/go/pkg/mod/go.etcd.io/bbolt@v1.3.8/bucket.go:105 +0x10c
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/home/runner/go/pkg/mod/go.etcd.io/bbolt@v1.3.8/tx.go:102
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).UnsafeRange(0xc0017a7da0, {0x12e17b0, 0x1b0dc40}, {0x1aa1670, 0x10, 0x10}, {0x0, 0x0, 0x0}, 0xc4798?)
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/batch_tx.go:174 +0xa0
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadConsistentIndex({0xffff39290758, 0xc0017a7da0})
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/schema/cindex.go:41 +0xa0
go.etcd.io/etcd/server/v3/storage/schema.unsafeUpdateConsistentIndex.func1()
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/schema/cindex.go:80 +0x54
go.etcd.io/etcd/client/pkg/v3/verify.Verify(0xc00044aa28)
	/home/runner/actions-runner/_work/etcd/etcd/client/pkg/verify/verify.go:71 +0x44
go.etcd.io/etcd/server/v3/storage/schema.unsafeUpdateConsistentIndex({0x12e5920, 0xc0017a7da0}, 0x7, 0x2, 0x0)
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/schema/cindex.go:79 +0x1d4
go.etcd.io/etcd/server/v3/storage/schema.UnsafeUpdateConsistentIndex(...)
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/schema/cindex.go:67
go.etcd.io/etcd/server/v3/etcdserver/cindex.(*consistentIndex).UnsafeSave(0xc001727800, {0x12e5920, 0xc0017a7da0})
	/home/runner/actions-runner/_work/etcd/etcd/server/etcdserver/cindex/cindex.go:121 +0x68
go.etcd.io/etcd/server/v3/storage.(*BackendHooks).OnPreCommitUnsafe(0xc001a0d9e0, {0x12e5920?, 0xc0017a7da0})
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/hooks.go:45 +0x64
go.etcd.io/etcd/server/v3/storage/backend.(*batchTxBuffered).unsafeCommit(0xc0017a7da0, 0x0)
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/batch_tx.go:342 +0xfc
go.etcd.io/etcd/server/v3/storage/backend.(*batchTxBuffered).commit(0xc0017a7da0, 0xdc?)
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/batch_tx.go:335 +0x70
go.etcd.io/etcd/server/v3/storage/backend.(*batchTxBuffered).Commit(0xc0017a7da0)
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/batch_tx.go:322 +0x3c
go.etcd.io/etcd/server/v3/storage/backend.(*backend).Snapshot(0xc000329140)
	/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/backend.go:331 +0x54
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*maintenanceServer).Snapshot(0xc0029ace40, 0x12e1120?, {0x12e6550, 0xc0006302b0})
	/home/runner/actions-runner/_work/etcd/etcd/server/etcdserver/api/v3rpc/maintenance.go:111 +0x110
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*authMaintenanceServer).Snapshot(0xc0019d7b60, 0x103b4c0?, {0x12e6550, 0xc0006302b0})
	/home/runner/actions-runner/_work/etcd/etcd/server/etcdserver/api/v3rpc/maintenance.go:296 +0xb8
go.etcd.io/etcd/api/v3/etcdserverpb._Maintenance_Snapshot_Handler({0x1004a40?, 0xc0019d7b60}, {0x12e3fc8, 0xc0028e4c00})
	/home/runner/actions-runner/_work/etcd/etcd/api/etcdserverpb/rpc.pb.go:7620 +0xe8
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).StreamServerInterceptor.func4({0x1004a40, 0xc0019d7b60}, {0x12e3d80?, 0xc0005681e0}, 0xc0028e4bd0, 0x111e8a8)
	/home/runner/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:121 +0x128
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainStreamServer.func7.1.1({0x1004a40, 0xc0019d7b60}, {0x12e3d80, 0xc0005681e0})
	/home/runner/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49 +0x70
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newStreamInterceptor.func1({0x1004a40, 0xc0019d7b60}, {0x12e3d80, 0xc0005681e0}, 0xc0028e4bd0, 0xc001992240)
	/home/runner/actions-runner/_work/etcd/etcd/server/etcdserver/api/v3rpc/interceptor.go:258 +0x560
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainStreamServer.func7.1.1({0x1004a40, 0xc0019d7b60}, {0x12e3d80, 0xc0005681e0})
	/home/runner/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49 +0x70
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainStreamServer.func7({0x1004a40, 0xc0019d7b60}, {0x12e3d80, 0xc0005681e0}, 0xc0028e4bd0, 0x111e8a8)
	/home/runner/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58 +0x100
google.golang.org/grpc.(*Server).processStreamingRPC(0xc00162c400, {0x12e1120, 0xc00185e870}, {0x12e8340, 0xc000753380}, 0xc00[284](https://github.com/etcd-io/etcd/actions/runs/7463174417/job/20307221683?pr=17220#step:5:285)6b40, 0xc0028ac2d0, 0x1b0c680, 0x0)
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1673 +0x16b8
google.golang.org/grpc.(*Server).handleStream(0xc00162c400, {0x12e8340, 0xc000753380}, 0xc002846b40)
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1787 +0x12f0
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1016 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 19027
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1027 +0x1d8

What did you expect to happen?

No panicking on processing any client requests

How can we reproduce it (as minimally and precisely as possible)?

Write an integration test to stop a member before call the snapshot api.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

fuweid · 2024-01-14T13:52:08Z

UPDATED:

When server closes the backend, the backend will stop background commit goroutine and reset read transaction.

etcd/server/storage/backend/read_tx.go

Lines 133 to 138 in c8b4b16

    
           func (rt *readTx) reset() { 
        
           	rt.buf.reset() 
        
           	rt.buckets = make(map[BucketID]*bolt.Bucket) 
        
           	rt.tx = nil 
        
           	rt.txWg = new(sync.WaitGroup) 
        
           }

For baseReadTx, it doesn't check the tx in UnsafeRange and update txWg.

etcd/server/storage/backend/batch_tx.go

Lines 367 to 376 in c8b4b16

    
           if t.backend.readTx.tx != nil { 
        
           	// wait all store read transactions using the current boltdb tx to finish, 
        
           	// then close the boltdb tx 
        
           	go func(tx *bolt.Tx, wg *sync.WaitGroup) { 
        
           		wg.Wait() 
        
           		if err := tx.Rollback(); err != nil { 
        
           			t.backend.lg.Fatal("failed to rollback tx", zap.Error(err)) 
        
           		} 
        
           	}(t.backend.readTx.tx, t.backend.readTx.txWg) 
        
           	t.backend.readTx.reset()

The tx can be set to nil during UnsafeRange.

serathius · 2024-04-05T11:55:56Z

Another case with etcd-io/bbolt#715

serathius · 2024-04-05T11:57:27Z

@fuweid would you be able to propose a fix? The issue showed up in robustness tests, which I would prefer to keep flake free.

fuweid · 2024-04-05T13:47:00Z

Hi @serathius Sure. Will file pull request later.

fuweid · 2024-04-15T06:58:24Z

Hi @ahrtr @serathius

sorry for taking so long on this issue. This issue has been fixed by gRPC layer grpc/grpc-go@61eab37 (released by v1.61.0 to fix regression): All the requests are tracked by handlerWG wait group. When we call GracefulStop, it will block until all the inflight requests are closed, including streaming type RPC, like Watch/Snapshot/LeaseRenew.

We call gracefulstop when we received SIGTERM signal, except cmux-mode.

etcd/server/embed/etcd.go

Lines 474 to 493 in a7f5d4b

    
           	ch := make(chan struct{}) 
        
           	go func() { 
        
           		defer close(ch) 
        
           		// close listeners to stop accepting new connections, 
        
           		// will block on any existing transports 
        
           		ss.grpc.GracefulStop() 
        
           	}() 
        
           	// wait until all pending RPCs are finished 
        
           	select { 
        
           	case <-ch: 
        
           	case <-ctx.Done(): 
        
           		// took too long, manually close open transports 
        
           		// e.g. watch streams 
        
           		ss.grpc.Stop() 
        
           		// concurrent GracefulStop should be interrupted 
        
           		<-ch 
        
           	} 
        
           }

We don't need to setup timeout for draining things because GracefulStop always blocks until all the RPCs finished.
Since GracefulStop isn't applied for cmux-mode, I filed a pull request #17790 to support cmux-mode's graceful shutdown. Even if we run ETCD without #17790, only ongoing Snapshot/Watch RPCs might panic, because we stop applier channel and scheduler before stop backend and all the unary RPCs won't use closed backend.

etcd/server/etcdserver/server.go

Lines 816 to 827 in a7f5d4b

    
           sched.Stop() 
        
           // wait for goroutines before closing raft so wal stays open 
        
           s.wg.Wait() 
        
           s.SyncTicker.Stop() 
        
           // must stop raft after scheduler-- etcdserver can leak rafthttp pipelines 
        
           // by adding a peer after raft stops the transport 
        
           s.r.stop() 
        
           s.Cleanup()

And the #17757 is also enhancement for failpoint test. PTAL. Thanks

Side note: I was using old version (1.60.1) so that previous approach is to introduce txRef object to maintain reference count about all opening TX. The caller must call txPut to release the reference explict. If the backend has been closed, the ReadTx/ConcurrentReadTx/BatchTx should return closed error. However, in our codebase, both mvcc and auth layers assume the backend is always valid. The ReadTx/ConcurrenReadTx/BatchTx are kind of direct pointer. And UnsafeRange doesn't intend to return error. I tried to update all the interfaces to force it return error. It passes all the e2e and UT. However, there are too many changes.

If the server layer can track active RPCs, it will be better. So, I revisit the gRPC code and find that WaitForHandlers can help us.

type txRef struct {
     sync.RWMutex
     wg sync.WaitGroup
}

type Backend interface {
      ReadTx() (ReadTx, TxRefReleaseFunc, error)
      ConcurrentReadTx() (ReadTx, TxRefReleaseFunc, error)
      BatchTx() (BatchTx, TxRefReleaseFunc, error)
}

tx, txPut, err := ReadTx() // ConncurrentReadTx() / BatchTx()
if err != nil {
    return err
}
defer txPut()
..

ahrtr added the type/bug label Jan 9, 2024

fuweid self-assigned this Jan 12, 2024

fuweid mentioned this issue Jan 12, 2024

integration: deflake TestMaintenanceSnapshotCancel #17235

Merged

ahrtr closed this as completed in #17235 Jan 14, 2024

ahrtr reopened this Jan 14, 2024

ahrtr mentioned this issue Jan 17, 2024

Some unit tests failed after adding three last commits #17232

Closed

4 tasks

ahrtr mentioned this issue Apr 5, 2024

Nil pointer dereference in bbolt 1.4.0-alpha.0 detected by robustness tests etcd-io/bbolt#715

Closed

serathius changed the title ~~Snapshot operation panicking due to backend closed~~ UnsafeRange panicking during shutdown Apr 5, 2024

ahrtr mentioned this issue Apr 18, 2024

storage/backend: fatal if there is panic during defrag #17757

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnsafeRange panicking during shutdown #17223

UnsafeRange panicking during shutdown #17223

ahrtr commented Jan 9, 2024

paste your configuration here

fuweid commented Jan 14, 2024

serathius commented Apr 5, 2024

serathius commented Apr 5, 2024

fuweid commented Apr 5, 2024

fuweid commented Apr 15, 2024 •

edited

UnsafeRange panicking during shutdown #17223

UnsafeRange panicking during shutdown #17223

Comments

ahrtr commented Jan 9, 2024

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

fuweid commented Jan 14, 2024

serathius commented Apr 5, 2024

serathius commented Apr 5, 2024

fuweid commented Apr 5, 2024

fuweid commented Apr 15, 2024 • edited

fuweid commented Apr 15, 2024 •

edited