-
Notifications
You must be signed in to change notification settings - Fork 168
Panic on restarting oklog docker container multiple times. #102
Comments
Want to help on this, but haven't been able to reproduce it again. Any intuitions about what could be the problem? |
It looks like the operative panic is here: // Per-segment state.
writeSegment, err := dst.Create()
if err != nil {
return n, err
}
defer func() {
// Don't leak active segments.
if writeSegment != nil {
if deleteErr := writeSegment.Delete(); deleteErr != nil {
panic(deleteErr) // <-------------
}
}
}() In the production code path, the abstract WriteSegment interface is implemented by concrete fileWriteSegment types, here's the code that creates them: func (fl *fileLog) Create() (WriteSegment, error) {
filename := filepath.Join(fl.root, fmt.Sprintf("%s%s", uuid.New(), extActive))
f, err := fl.filesys.Create(filename)
if err != nil {
return nil, err
}
return &fileWriteSegment{fl.filesys, f}, nil
} And for completeness here's the complete definition and methodset on that type: type fileWriteSegment struct {
fs fs.Filesystem
f fs.File
}
func (w fileWriteSegment) Write(p []byte) (int, error) {
return w.f.Write(p)
}
// Close the segment and make it available for query.
func (w fileWriteSegment) Close(low, high ulid.ULID) error {
if err := w.f.Close(); err != nil {
return err
}
oldname := w.f.Name()
oldpath := filepath.Dir(oldname)
newname := filepath.Join(oldpath, fmt.Sprintf("%s-%s%s", low.String(), high.String(), extFlushed))
if w.fs.Exists(newname) {
return errors.Errorf("file %s already exists", newname)
}
return w.fs.Rename(oldname, newname)
}
// Delete the segment.
func (w fileWriteSegment) Delete() error {
if err := w.f.Close(); err != nil {
return err
}
return w.fs.Remove(w.f.Name())
} The task is to figure out how the underlying fs.File gets closed in between the Create and Delete. There may be a bug in the function itself, but I'd put the probably there relatively low. I suspect instead there is some interference on the filesystem between the compactor code and the store query code. If that's true, then the correct way to deal with this error is for the query code to fail more gracefully. Hopefully this is something to start with! |
This seems to be a heisenbug. But now I am repeatedly seeing another with the same setup: ts=2018-02-22T14:07:07.839162625Z level=info cluster_bind=0.0.0.0:7659 goroutine 44 [running]: Will create a separate issue for this after investigation. |
I have a docker compose file setup like this:
After multiple
docker-compose up
anddocker-compose down
commands(there doesn't seem to be a consistency to the number of ups and downs), the oklog container panics with:I found the issue for 0.3.0 release and rebuilt oklog:master to check if it's already fixed.
Thank you.
The text was updated successfully, but these errors were encountered: