Crash when opening same NC4 file from different threads BUT under global mutex #2496

rouault · 2022-09-03T17:55:57Z

This is an attempt at providing a minimum reproducer for OSGeo/gdal#6253

The attached docker.zip contains a simple Dockerfile building libhdf5 and libnetcdf, and a simple C++ program.
The C++ program loops at creating 2 threads, which open the same NC4 file and one calls nc_inq_varname(). It is to be noted that all calls to the netCDF API are protected by a common mutex , so there's no concurrent access to the netCDF API.

I've tried different versions of hdf5 and netcdf, and compiling hdf5 with or without --enable-unsupported --enable-threadsafe, but the crash always occur

How to reproduce:

unzip docker.zip
cd docker
docker build -t netcdf_issue .
docker run --rm -it netcdf_issue:latest  /usr/bin/test /alldatatypes.nc
docker run --rm -it netcdf_issue:latest  valgrind /usr/bin/test /alldatatypes.nc || echo "this failed!"

results in

T1 begin
T2 begin
T1 open 65536
T2 open 131072
T1 close 65536
T1 end
T2 begin nc_inq_varname(131072, 15)
this failed!

and under valgrind:

$ docker run --rm -it netcdf_issue:latest  valgrind /usr/bin/test /alldatatypes.nc
[...]
==1== Thread 3:
==1== Invalid read of size 1
==1==    at 0x4F921F8: H5F_addr_decode (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x519A1F6: H5VL__native_blob_specific (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x51922AC: H5VL_blob_specific (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x517F48F: H5T__vlen_disk_isnull (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x510387B: H5T__conv_vlen (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x50EDD6B: H5T_convert (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x4F57159: H5D_get_create_plist (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x519B3D4: H5VL__native_dataset_get (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x5186A77: H5VL_dataset_get (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x4F2772A: H5Dget_create_plist (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x48EDCB5: nc4_get_var_meta (in /usr/lib/libnetcdf.so.19.1.0)
==1==    by 0x48EAC09: nc4_hdf5_find_grp_var_att (in /usr/lib/libnetcdf.so.19.1.0)
==1==  Address 0x40 is not stack'd, malloc'd or (recently) free'd
==1== 
==1== 
==1== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==1==  Access not within mapped region at address 0x40
==1==    at 0x4F921F8: H5F_addr_decode (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x519A1F6: H5VL__native_blob_specific (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x51922AC: H5VL_blob_specific (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x517F48F: H5T__vlen_disk_isnull (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x510387B: H5T__conv_vlen (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x50EDD6B: H5T_convert (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x4F57159: H5D_get_create_plist (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x519B3D4: H5VL__native_dataset_get (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x5186A77: H5VL_dataset_get (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x4F2772A: H5Dget_create_plist (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x48EDCB5: nc4_get_var_meta (in /usr/lib/libnetcdf.so.19.1.0)
==1==    by 0x48EAC09: nc4_hdf5_find_grp_var_att (in /usr/lib/libnetcdf.so.19.1.0)
==1==  If you believe this happened as a result of a stack

Dockerfile:

FROM ubuntu:20.04
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --fix-missing --no-install-recommends g++ make autoconf automake zlib1g-dev wget tar ca-certificates
RUN wget https://github.com/HDFGroup/hdf5/archive/refs/tags/hdf5-1_13_2.tar.gz && \
    tar xvzf hdf5-1_13_2.tar.gz && \
    cd hdf5-hdf5-1_13_2 && \
    #  --enable-unsupported --enable-threadsafe
    ./configure --prefix=/usr --disable-static --disable-tests && \
    make -j$(nproc) install
RUN wget https://github.com/Unidata/netcdf-c/archive/refs/tags/v4.9.0.tar.gz && \
    tar xvzf v4.9.0.tar.gz && \
    cd netcdf-c-4.9.0 && \
    ./configure --prefix=/usr && \
    make -j$(nproc) install
COPY test.cpp /
COPY alldatatypes.nc /
RUN g++ /test.cpp -lnetcdf -lpthread -o /usr/bin/test
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --fix-missing --no-install-recommends valgrind

test.cpp:

#include <thread>
#include <mutex>
#include <chrono>

#include "netcdf.h"

std::mutex oMutex;

int main(int argc, char* argv[])
{
    for (int i = 0; i < 1000; i++) {
        std::thread t1([argv] {
            int cdfid;
            printf("T1 begin\n");
            {
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_open(argv[1], NC_NOWRITE, &cdfid);
                printf("T1 open %d\n", cdfid);
            }
            std::this_thread::sleep_for(std::chrono::microseconds(1));
            {
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_close(cdfid);
                printf("T1 close %d\n", cdfid);
            }
            printf("T1 end\n");
        });
        std::thread t2([argv] {
            int cdfid2;
            printf("T2 begin\n");
            {
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_open(argv[1], NC_NOWRITE, &cdfid2);
                printf("T2 open %d\n", cdfid2);
            }
            std::this_thread::sleep_for(std::chrono::microseconds(1));
            {
                int nVarId = 0;
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_inq_varid(cdfid2, "string_var", &nVarId);
                char szName[NC_MAX_NAME + 1] = {};
                printf("T2 begin nc_inq_varname(%d, %d)\n", cdfid2, nVarId);
                nc_inq_varname(cdfid2, nVarId, szName);
                printf("T2 end nc_inq_varname(%d, %d)\n", cdfid2, nVarId);
            }
            {
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_close(cdfid2);
                printf("T2 close %d\n", cdfid2);
            }
            printf("T2 end\n");

        });
        t1.join();
        t2.join();
    }
    printf("success!\n");
    return 0;
}

The text was updated successfully, but these errors were encountered:

lnicola · 2022-09-03T18:03:51Z

Possibly related, we're seeing a similar crash when opening different files: georust/gdal#299. One difference is that there it seems to overflow the stack while trying to print a stack trace (but that's just a guess, I'm not familiar with the two libraries).

DennisHeimbigner · 2022-09-03T20:35:02Z

Do you know if you are using pthreads for your mutex?

rouault · 2022-09-03T20:52:26Z

Do you know if you are using pthreads for your mutex?

using std::thread requires linking the code with -lpthread. Apparently the std::mutex implementation on Linux / gcc also uses pthread_mutex_lock() / pthread_mutex_unlock() underneeth.

cf

$ cat test2.cpp 

#include <mutex>

std::mutex x;
int y = 0;
int foo()
{
    std::lock_guard<std::mutex> locker(x);
    return ++y;
}

$ g++ -O2 test2.cpp -c
$ LC_ALL=C objdump -sdwxC test2.o

0000000000000000 <foo()>:
   0:	f3 0f 1e fa          	endbr64 
   4:	41 54                	push   %r12
   6:	53                   	push   %rbx
   7:	48 83 ec 08          	sub    $0x8,%rsp
   b:	48 8b 1d 00 00 00 00 	mov    0x0(%rip),%rbx        # 12 <foo()+0x12>	e: R_X86_64_REX_GOTPCRELX	__pthread_key_create-0x4
  12:	48 85 db             	test   %rbx,%rbx
  15:	74 10                	je     27 <foo()+0x27>
  17:	48 8d 3d 00 00 00 00 	lea    0x0(%rip),%rdi        # 1e <foo()+0x1e>	1a: R_X86_64_PC32	x-0x4
  1e:	e8 00 00 00 00       	callq  23 <foo()+0x23>	1f: R_X86_64_PLT32	pthread_mutex_lock-0x4
  23:	85 c0                	test   %eax,%eax
  25:	75 2d                	jne    54 <foo()+0x54>
  27:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax        # 2d <foo()+0x2d>	29: R_X86_64_PC32	y-0x4
  2d:	44 8d 60 01          	lea    0x1(%rax),%r12d
  31:	44 89 25 00 00 00 00 	mov    %r12d,0x0(%rip)        # 38 <foo()+0x38>	34: R_X86_64_PC32	y-0x4
  38:	48 85 db             	test   %rbx,%rbx
  3b:	74 0c                	je     49 <foo()+0x49>
  3d:	48 8d 3d 00 00 00 00 	lea    0x0(%rip),%rdi        # 44 <foo()+0x44>	40: R_X86_64_PC32	x-0x4
  44:	e8 00 00 00 00       	callq  49 <foo()+0x49>	45: R_X86_64_PLT32	pthread_mutex_unlock-0x4
  49:	48 83 c4 08          	add    $0x8,%rsp
  4d:	44 89 e0             	mov    %r12d,%eax
  50:	5b                   	pop    %rbx
  51:	41 5c                	pop    %r12
  53:	c3                   	retq   
  54:	89 c7                	mov    %eax,%edi
  56:	e8 00 00 00 00       	callq  5b <x+0x3b>	57: R_X86_64_PLT32	std::__throw_system_error(int)-0x4

To be noted that the original reproducer with GDAL uses a global lock that is a pthread_mutex_t
I seem to have found a workaround on GDAL side in OSGeo/gdal#6311 by re-using the same netcdf handle when sequences like nc_open("sam_file.nc", ....); nc_open("sam_file.nc", ....); are done

rouault mentioned this issue Sep 3, 2022

Crash in MDArray API when opening same file from multiple threads OSGeo/gdal#6253

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when opening same NC4 file from different threads BUT under global mutex #2496

Crash when opening same NC4 file from different threads BUT under global mutex #2496

rouault commented Sep 3, 2022

lnicola commented Sep 3, 2022

DennisHeimbigner commented Sep 3, 2022

rouault commented Sep 3, 2022

Crash when opening same NC4 file from different threads BUT under global mutex #2496

Crash when opening same NC4 file from different threads BUT under global mutex #2496

Comments

rouault commented Sep 3, 2022

lnicola commented Sep 3, 2022

DennisHeimbigner commented Sep 3, 2022

rouault commented Sep 3, 2022