Skip to content
This repository has been archived by the owner on Sep 8, 2018. It is now read-only.

Panic on restarting oklog docker container multiple times. #102

Open
adnaan opened this issue Feb 2, 2018 · 3 comments
Open

Panic on restarting oklog docker container multiple times. #102

adnaan opened this issue Feb 2, 2018 · 3 comments
Labels

Comments

@adnaan
Copy link

adnaan commented Feb 2, 2018

I have a docker compose file setup like this:

... 
oklog: 
    build: ./oklog
    container_name: ${OKLOG_CONTAINER_NAME}
    command: ingeststore -store.segment-replication-factor 1
    ports: 
      - 7651:7651
      - 7650:7650
      - 7653:7653
      - 7659:7659
    volumes:
      - ${OSX_LOG_DIR}:/data
    
  logspout: 
    image: gliderlabs/logspout
    container_name: ${LOGSPOUT_CONTAINER_NAME}
    command: raw+tcp://${OKLOG_CONTAINER_NAME}:7651
    depends_on:
      - oklog
    volumes: 
      - /var/run/docker.sock:/var/run/docker.sock
    links:
      - oklog

After multiple docker-compose up and docker-compose down commands(there doesn't seem to be a consistency to the number of ups and downs), the oklog container panics with:

$docker logs -f oklog
ts=2018-02-02T19:16:00.101587594Z level=info cluster_bind=0.0.0.0:7659
ts=2018-02-02T19:16:00.110157207Z level=info user_bind_host=0.0.0.0 user_advertise_host= calculated_advertise_ip=172.18.0.2
ts=2018-02-02T19:16:00.112425772Z level=info fast=tcp://0.0.0.0:7651
ts=2018-02-02T19:16:00.112505032Z level=info durable=tcp://0.0.0.0:7652
ts=2018-02-02T19:16:00.112891239Z level=info bulk=tcp://0.0.0.0:7653
ts=2018-02-02T19:16:00.113107131Z level=info API=tcp://0.0.0.0:7650
ts=2018-02-02T19:16:00.161997474Z level=info ingest_path=data/ingest
ts=2018-02-02T19:16:07.109384344Z level=info store_path=data/store
panic: close data/store/bcd73f75-18ce-415e-ad9b-bf4025fd7755.active: file already closed

goroutine 47 [running]:
github.com/oklog/oklog/pkg/store.mergeRecordsToLog.func1(0xc420455b38)
	/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/read.go:150 +0x65
github.com/oklog/oklog/pkg/store.mergeRecordsToLog(0xd6a200, 0xc42007a410, 0x8000000, 0xc42011d180, 0x34, 0x34, 0x100001a0, 0xd5e780, 0xc420166040)
	/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/read.go:239 +0xa9e
github.com/oklog/oklog/pkg/store.(*Compacter).compact(0xc420068b40, 0xa4e1e9, 0xb, 0xc42003edf0, 0x0, 0x0, 0x0)
	/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/compact.go:126 +0x2f7
github.com/oklog/oklog/pkg/store.(*Compacter).Run.func1()
	/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/compact.go:52 +0x71
github.com/oklog/oklog/pkg/store.(*Compacter).Run(0xc420068b40)
	/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/compact.go:62 +0x2f5
main.runIngestStore.func14(0x0, 0x0)
	/Users/adnaan/code/gopath/src/github.com/oklog/oklog/cmd/oklog/ingeststore.go:441 +0x2a
github.com/oklog/oklog/vendor/github.com/oklog/run.(*Group).Run.func1(0xc420068ba0, 0xc420040ec0, 0xc420040ef0)
	/Users/adnaan/code/gopath/src/github.com/oklog/oklog/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/oklog/oklog/vendor/github.com/oklog/run.(*Group).Run
	/Users/adnaan/code/gopath/src/github.com/oklog/oklog/vendor/github.com/oklog/run/group.go:37 +0xa8

I found the issue for 0.3.0 release and rebuilt oklog:master to check if it's already fixed.

Thank you.

@adnaan
Copy link
Author

adnaan commented Feb 4, 2018

Want to help on this, but haven't been able to reproduce it again. Any intuitions about what could be the problem?

@peterbourgon
Copy link
Member

It looks like the operative panic is here:

	// Per-segment state.
	writeSegment, err := dst.Create()
	if err != nil {
		return n, err
	}
	defer func() {
		// Don't leak active segments.
		if writeSegment != nil {
			if deleteErr := writeSegment.Delete(); deleteErr != nil {
				panic(deleteErr) // <-------------
			}
		}
	}()

In the production code path, the abstract WriteSegment interface is implemented by concrete fileWriteSegment types, here's the code that creates them:

func (fl *fileLog) Create() (WriteSegment, error) {
	filename := filepath.Join(fl.root, fmt.Sprintf("%s%s", uuid.New(), extActive))
	f, err := fl.filesys.Create(filename)
	if err != nil {
		return nil, err
	}
	return &fileWriteSegment{fl.filesys, f}, nil
}

And for completeness here's the complete definition and methodset on that type:

type fileWriteSegment struct {
	fs fs.Filesystem
	f  fs.File
}

func (w fileWriteSegment) Write(p []byte) (int, error) {
	return w.f.Write(p)
}

// Close the segment and make it available for query.
func (w fileWriteSegment) Close(low, high ulid.ULID) error {
	if err := w.f.Close(); err != nil {
		return err
	}
	oldname := w.f.Name()
	oldpath := filepath.Dir(oldname)
	newname := filepath.Join(oldpath, fmt.Sprintf("%s-%s%s", low.String(), high.String(), extFlushed))
	if w.fs.Exists(newname) {
		return errors.Errorf("file %s already exists", newname)
	}
	return w.fs.Rename(oldname, newname)
}

// Delete the segment.
func (w fileWriteSegment) Delete() error {
	if err := w.f.Close(); err != nil {
		return err
	}
	return w.fs.Remove(w.f.Name())
}

The task is to figure out how the underlying fs.File gets closed in between the Create and Delete. There may be a bug in the function itself, but I'd put the probably there relatively low. I suspect instead there is some interference on the filesystem between the compactor code and the store query code. If that's true, then the correct way to deal with this error is for the query code to fail more gracefully.

Hopefully this is something to start with!

@adnaan
Copy link
Author

adnaan commented Feb 22, 2018

This seems to be a heisenbug. But now I am repeatedly seeing another with the same setup:

ts=2018-02-22T14:07:07.839162625Z level=info cluster_bind=0.0.0.0:7659
ts=2018-02-22T14:07:07.850407666Z level=info fast=tcp://0.0.0.0:7651
ts=2018-02-22T14:07:07.850480233Z level=info durable=tcp://0.0.0.0:7652
ts=2018-02-22T14:07:07.850518041Z level=info bulk=tcp://0.0.0.0:7653
ts=2018-02-22T14:07:07.850546563Z level=info API=tcp://0.0.0.0:7650
ts=2018-02-22T14:07:07.851144625Z level=info ingest_path=data/ingest
ts=2018-02-22T14:07:07.851653699Z level=info store_path=data/store
panic: ulid: bad data size when unmarshaling

goroutine 44 [running]:
github.com/oklog/ulid.MustParse(0xc4202413cb, 0x8, 0x0, 0x0)
/Users/xla/dev/src/github.com/oklog/ulid/ulid.go:98 +0x88
github.com/oklog/oklog/pkg/store.(*fileLog).Overlapping(0xc420109d40, 0x40, 0x8bc93b, 0xc4203d6bc8, 0xc4203d6bc0, 0xc42012bb90)
/Users/xla/dev/src/github.com/oklog/oklog/pkg/store/file_log.go:156 +0x70b
github.com/oklog/oklog/pkg/store.(Log).Overlapping-fm(0xe00000040, 0xa93648, 0xc4203745a0, 0xa5af24, 0xb)
/Users/xla/dev/src/github.com/oklog/oklog/pkg/store/compact.go:48 +0x2f
github.com/oklog/oklog/pkg/store.(*Compacter).compact(0xc4203745a0, 0xa5af24, 0xb, 0xc4203d6df8, 0x0, 0x0, 0x0)
/Users/xla/dev/src/github.com/oklog/oklog/pkg/store/compact.go:86 +0xf0
github.com/oklog/oklog/pkg/store.(*Compacter).Run.func1()
/Users/xla/dev/src/github.com/oklog/oklog/pkg/store/compact.go:48 +0x7a
github.com/oklog/oklog/pkg/store.(*Compacter).Run(0xc4203745a0)
/Users/xla/dev/src/github.com/oklog/oklog/pkg/store/compact.go:58 +0x2b1
main.runIngestStore.func14(0x0, 0x0)
/Users/xla/dev/src/github.com/oklog/oklog/cmd/oklog/ingeststore.go:429 +0x2a
github.com/oklog/oklog/pkg/group.(*Group).Run.func1(0xc420374660, 0xc4202e91d0, 0xc4202e91e0)
/Users/xla/dev/src/github.com/oklog/oklog/pkg/group/group.go:38 +0x27
created by github.com/oklog/oklog/pkg/group.(*Group).Run
/Users/xla/dev/src/github.com/oklog/oklog/pkg/group/group.go:39 +0xa8

Will create a separate issue for this after investigation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants