Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic: NULL pointer dereference in ZFS dva_get_dsize_sync #191

Open
smokris opened this issue Mar 5, 2019 · 6 comments
Open

Panic: NULL pointer dereference in ZFS dva_get_dsize_sync #191

smokris opened this issue Mar 5, 2019 · 6 comments

Comments

@smokris
Copy link

smokris commented Mar 5, 2019

One of my systems recently experienced some minor disk corruption, and panicked:

> ::status
debugging crash dump vmcore.0 (64-bit) from […]
operating system: 5.11 joyent_20181011T004530Z (i86pc)
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe000e652780 addr=2a8 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $C
fffffe000e652890 dva_get_dsize_sync+0x5d(fffffe0910374000, fffffe0a64cd35c0)
fffffe000e6528e0 bp_get_dsize_sync+0xab(fffffe0910374000, fffffe0a64cd35c0)
fffffe000e652950 dbuf_write_ready+0x71(fffffe0a64cd3450, fffffe0916b548f0, fffffe09353e8ec8)
fffffe000e6529c0 arc_write_ready+0x120(fffffe0a64cd3450)
fffffe000e652a20 zio_ready+0x5b(fffffe0a64cd3450)
fffffe000e652a50 zio_execute+0x7f(fffffe0a64cd3450)
fffffe000e652b10 taskq_thread+0x2d0(fffffe090fde64c8)
fffffe000e652b20 thread_start+8()

I'm not able to reproduce the panic (I've since rolled back a few transactions in order to get the pool running again), but I do have the vmdump.

The stack trace is similar to openzfs/zfs#1891, which references https://www.illumos.org/issues/5349. I confirmed that the fix to 5349 is present in the illumos-joyent fork (f63ab3d)… so it's surprising that this panic happened due to a NULL dereference rather than one of the more specific panic messages in zfs_blkptr_verify.

@rmustacc
Copy link

Has anyone reached out to get the dump that occurred so we can try to better investigate what's going on? In terms of minor disk corruption, do you have any idea what the source of that was?

@smokris
Copy link
Author

smokris commented Mar 12, 2019

Has anyone reached out to get the dump that occurred so we can try to better investigate what's going on?

No, not yet.

In terms of minor disk corruption, do you have any idea what the source of that was?

Prior to the panic, zpool scrub had reported a few checksum errors. This pool hasn't had panics or other unclean shutdowns prior to that, so I assume it's just typical bit rot.

@smokris
Copy link
Author

smokris commented Mar 15, 2019

Ack, it just happened again. Is there any particular info that would be helpful for me to collect, before I try rolling back transactions again?

@smokris
Copy link
Author

smokris commented Mar 25, 2019

After the March 15 panic, I'm unable import the pool, even if I roll back all available TXGs:

zpool import -o readonly=on           zones   # panic: "blkptr at … has invalid CHECKSUM 0"
zpool import -o readonly=on -T 515235 zones   # panic: "blkptr at … has invalid CHECKSUM 0"
zpool import -o readonly=on -T 515234 zones   # "one or more devices is currently unavailable"

I also tried booting into Ubuntu Server 18.04 to see if ZoL could import it (it couldn't; same panic), and to see if https://gist.github.com/jshoward/5685757 could destroy the bad blocks to enable it to import (it destroyed them but it still panicked when importing).

Lacking other recovery options, I destroyed and rebuilt the pool. That took the core dump with it, so I no longer have information beyond what I've already posted on this issue.

@smokris
Copy link
Author

smokris commented Mar 29, 2019

This afternoon it panicked again — same panic message and stack trace as original post — so I have another core dump available (which I've copied to another system, in case the pool gets corrupted again).

@KodyKantor
Copy link

Hi @smokris. Thanks for filing this issue, and sorry for the delay in tracking this down. It'd be great if we could get access to one of the crash dumps from this system. I'll follow up with you over email with instructions on how to get that dump to us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants