Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nil panic on remove #127

Closed
wants to merge 1 commit into from

Conversation

zachbadgett
Copy link

Fix for panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6e4a5de]
goroutine 444 [running]:
github.com/hashicorp/golang-lru/v2/simplelru.(*lruList[...]).remove(...)
	github.com/hashicorp/golang-lru/v2@v2.0.2/simplelru/list.go:89
github.com/hashicorp/golang-lru/v2/simplelru.(*LRU[...]).removeElement(0xc002244180?, 0xc0e49d408c?)
	github.com/hashicorp/golang-lru/v2@v2.0.2/simplelru/lru.go:159 +0x1e
github.com/hashicorp/golang-lru/v2/simplelru.(*LRU[...]).Remove(0xc19b580, {0xc0e49d408c, 0x547e17?})
	github.com/hashicorp/golang-lru/v2@v2.0.2/simplelru/lru.go:98 +0x57
github.com/hashicorp/golang-lru/v2.(*Cache[...]).Remove(0xc1950c0, {0xc0e49d408c, 0xc})
	github.com/hashicorp/golang-lru/v2@v2.0.2/lru.go:172 +0x5f

Not really sure how this panic happens and I can only reproduce it in a high concurrent environment. From what I can tell the panic happens when e.prev is nil and the only time e.prev is set to nil is when it's being removed from the list. removeElement is suppose to delete the key from items map in LRU, but somehow we still seem to get the entry back. If it is an entry we've already removed, the list would be set to nil, adding a condition to check if entry has the correct list would avoid this panic.

Go version: 1.20.3/1.20.2

@hashicorp-cla
Copy link

hashicorp-cla commented Apr 17, 2023

CLA assistant check
All committers have signed the CLA.

@jefferai
Copy link
Member

Are you using simplelru directly or are you using via the main lru.Cache object?

@zachbadgett
Copy link
Author

zachbadgett commented Apr 18, 2023

I'm using lru.Cache, it's in the panic output. I also have seen this behavior with the 2q implementation.

This PR does fix the panic, I moved to a local copy that has the fix applied and haven't seen it since. However, I would like to be able to explain how it's happening...I just haven't been able to find the reason.

@jefferai
Copy link
Member

I agree about the cause of the nil, but what's running around my head is that the operations calling the list should be behind the mutex in lru.Cache. It seems like some operation affecting the list is being called outside of the mutex. I don't have time to dig in at the moment but I'd like to try to figure out the root cause.

@zachbadgett
Copy link
Author

Yeah, it's puzzling. The panic seemed to be fairly random, my application could run between 1-8 hours before seeing the issue. The LRU was being shared between ~2000 goroutines doing about 40k/second in operations. In a smaller environment, it never saw the panic.

@jefferai
Copy link
Member

What version of Go is your application built with?

@jefferai
Copy link
Member

Took another look at the code. I can't figure out why e.list would be nil, although there are a few other checks for e.list being nil elsewhere in the code -- but that might purely be safety.

I also wonder if there is some actual issue with the mutex since if the mutex is working properly I can't find any path where it wouldn't be protected. That's why I'm interested in your Go version, to see if there have been any known bugs fixed between then and now.

In your code, you don't happen to be passing around an lru.Cache object as a non-pointer, do you? The mutex contained within is not a pointer, so if the lru.Cache object is not a pointer then that could certainly cause issues.

Finally, I'm not sure that the PR is correct because the panic is on e.prev.next = e.next. Checking for e.list being nil feels more like a symptom of a different issue -- for instance, e.prev might be nil, or e.prev.next. In my mind the most likely thing happening is that e itself is nil but I have yet to chase down if and how that might happen.

@zachbadgett
Copy link
Author

zachbadgett commented Apr 20, 2023

What version of Go is your application built with?

I included the Go version in the description, saw it in both 1.20.3 and 1.20.2

In your code, you don't happen to be passing around an lru.Cache object as a non-pointer, do you? The mutex contained within is not a pointer, so if the lru.Cache object is not a pointer then that could certainly cause issues.

lru.Cache is passed as a pointer.

Finally, I'm not sure that the PR is correct because the panic is on e.prev.next = e.next. Checking for e.list being nil feels more like a symptom of a different issue -- for instance, e.prev might be nil, or e.prev.next. In my mind the most likely thing happening is that e itself is nil but I have yet to chase down if and how that might happen.

e is not nil, this change is currently running in my environments without nil dereference issues.
e.prev.next being nil wouldn't cause the panic since next is what is being assigned and since e is not nil then the panic is coming e.prev not having a reference to next.

Checking for e.list being nil feels more like a symptom of a different issue

I do feel the same way, perhaps there is a flaw in the list itself. #71 is a similar issue but it uses container/list

@jefferai
Copy link
Member

It's not clear to me that e isn't nil; if the issue is that sometimes the value being passed to remove is nil when it shouldn't be, then e could be nil. Also note that the panic can't be that e.prev doesn't have a reference to next, because assigning a nil pointer is fine. If it's not e, then e.prev has to be nil.

The list here is a modified container/list that was modified to accept generics. #71 is a good callout because it means either container/list has an issue or the issue does not lie in the list code itself, which lends some credence to the idea that maybe it's being passed a nil e to the remove function.

@zachbadgett
Copy link
Author

zachbadgett commented May 1, 2023

Was able to reproduce this by changing the underlying key pointer. Changing the underlying data of the key happens outside the mutex and can be changed after the entry is pulled for remove but before delete(c.items, e.key) causing the item to remain in the map with e.next, e.prev, and e.list as nil.

Closing PR since it's most likely an implementation issue

@zachbadgett zachbadgett closed this May 1, 2023
@zachbadgett zachbadgett deleted the panic-fix-remove branch May 8, 2023 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants