Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random Segfaults on distance_transform_edt with Intel 12 Alder lake (E-Core enabled) #22744

Closed
ZHCSOFT opened this issue Mar 10, 2022 · 5 comments
Labels

Comments

@ZHCSOFT
Copy link

ZHCSOFT commented Mar 10, 2022

Hi everyone

I am currently training a image segmentation network with PyTorch evaluated with hausdorff distance loss. To calculate hausdorff loss, I am using distance_transform_edt from scipy.ndimage
associated with morpholopy.py provided by scikit-learn.

My training script works well on other platforms, including PC(Intel i5-9400F, RTX 2060, Windows 10), Server 1 (AMD Ryzen 7 2700X, RTX A4000, Fedora 33), Server 2 (AMD Ryzen 7 3700X, RTX A4000, Fedora 34).

However, when I try to train my model on my linux PC (Intel i7-12700K, RTX 3080Ti, Manjaro, Linux Core: 5.16), my computer crashs several times. Mostly just training terminated with out exception, and it shows Segmentation fault related to threading.py, queues.py, morphology.py (details describe below), and sometimes it even causes linux kernel panic so I have to force reboot for getting control.

It occurs randomly, I have tried to install Ubuntu 20.04 LTS with linux kernel 5.15 or 5.16, install PyTorch Nightly version, install scikit-learn-intelex, install latest numpy with mkl, but it still happens.

No evidence of over-temperature, GPU memory overflow can be observed by utilizing sensors and nvidia-smi command.

I have noticed that on Intel 12th Alder lake some architecture have been changed to improve performance so that seems suspicious.

Any idea what I can do?

Thanks in advance.

Steps/Code to Reproduce

import os
import scipy.signal
import torch
from torch import nn
from scipy.ndimage import distance_transform_edt
import torch.nn.functional as F
from matplotlib import pyplot as plt
from torch import Tensor
import numpy as np
from torch.autograd import Variable

...

class HDDTBinaryLoss(nn.Module):
    def __init__(self):
        """
        Ref: https://arxiv.org/pdf/1904.10030v1.pdf
        """
        super(HDDTBinaryLoss, self).__init__()

    def compute_edts_forhdloss(self, segmentation):
        res = np.zeros(segmentation.shape)
        for i in range(segmentation.shape[0]):
            posmask = segmentation[i]
            negmask = ~posmask
            res[i] = distance_transform_edt(posmask) + distance_transform_edt(negmask)
        return res

    def forward(self, net_output, target):
        """
        net_output: (batch, 2, x, y, z)
        target: ground truth, shape:(batch_size, 1, x, y, z)
        """
        predict_result = net_output.float()
        ground_truth = target.float()
        predict_dist = self.compute_edts_forhdloss(predict_result.detach().cpu().numpy() > 0.5)
        ground_truth_dist = self.compute_edts_forhdloss(ground_truth.detach().cpu().numpy() > 0.5)
        pred_error = (ground_truth - predict_result) ** 2
        dist = predict_dist**2 + ground_truth_dist**2
        dist = torch.from_numpy(dist)
        if dist.device != pred_error.device:
            dist = dist.to(pred_error.device).float()
        multipled = torch.einsum("bxyz,bxyz->bxyz", pred_error, dist)

        return hd_loss

Expected Results

I can get numerical result correctly

Actual Results

Start Validation
Epoch 32/2000: 100%|█████████████████████████████████| 99/99 [00:12<00:00, 7.67it/s, f_h=0.129, hd_h=0.0341, total_loss=0.163]
Finish Validation
Epoch:32/2000
Total Loss: 0.4968 || Val Loss: 0.0538
Epoch 33/2000: 25%|██████ | 201/800 [00:35<01:46, 5.64it/s, Total=0.13, f_h=0.106, hd_h=0.0239, s/step=0.53]

Fatal Python error: Segmentation fault

Thread 0x00007fbf7ba00640 (most recent call first):
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 302 in wait
File “.conda/envs/Torch/lib/python3.8/multiprocessing/queues.py”, line 227 in _feed
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 870 in run
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 932 in _bootstrap_inner
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 890 in _bootstrap

Thread 0x00007fbf7a0b5640 (most recent call first):
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 302 in wait
File “.conda/envs/Torch/lib/python3.8/multiprocessing/queues.py”, line 227 in _feed
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 870 in run
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 932 in _bootstrap_inner
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 890 in _bootstrap

Thread 0x00007fbf79530640 (most recent call first):
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 302 in wait
File “.conda/envs/Torch/lib/python3.8/multiprocessing/queues.py”, line 227 in _feed
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 870 in run
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 932 in _bootstrap_inner
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 890 in _bootstrap

Thread 0x00007fbf7ac3a640 (most recent call first):
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 302 in wait
File “.conda/envs/Torch/lib/python3.8/multiprocessing/queues.py”, line 227 in _feed
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 870 in run
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 932 in _bootstrap_inner
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 890 in _bootstrap

Thread 0x00007fbf75fff640 (most recent call first):
〈no Python frame〉

Thread 0x00007fbf7c585640 (most recent call first):
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 306 in wait
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 558 in wait
File “.conda/envs/Torch/lib/python3.8/site-packages/tqdm/_monitor.py”, line 60 in run
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 932 in _bootstrap_inner
File “.conda/envs/Torch/lib/python3.8/threading.py”, line 890 in _bootstrap

Current thread 0x00007fc078adf380 (most recent call first):
File “.conda/envs/Torch/lib/python3.8/site-packages/scipy/ndimage/morphology.py”, line 2299 in distance_transform_edt
File “training_utils.py”, line 190 in compute_edts_forhdloss
File “training_utils.py”, line 203 in forward
File “.conda/envs/Torch/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1102 in _call_impl
File “training_script.py”, line 68 in fit_epochs
File “training_script.py”, line 296 in module

dmesg:

[Wed Mar  9 12:37:32 2022] audit: type=1701 audit(1646800652.729:189): auid=1000 uid=1000 gid=1000 ses=4 subj==unconfined pid=1925 comm="python" exe="/home/zhcsoft/anaconda3/envs/Torch/bin/python3.9" sig=11 res=1
[Wed Mar  9 12:37:32 2022] audit: type=1334 audit(1646800652.732:190): prog-id=28 op=LOAD
[Wed Mar  9 12:37:32 2022] audit: type=1334 audit(1646800652.732:191): prog-id=29 op=LOAD
[Wed Mar  9 12:37:32 2022] audit: type=1334 audit(1646800652.732:192): prog-id=30 op=LOAD
[Wed Mar  9 12:37:32 2022] audit: type=1130 audit(1646800652.732:193): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-coredump@2-14749-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[Wed Mar  9 12:37:36 2022] audit: type=1131 audit(1646800656.932:194): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-coredump@2-14749-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[Wed Mar  9 12:37:36 2022] audit: type=1334 audit(1646800657.046:195): prog-id=0 op=UNLOAD
[Wed Mar  9 12:37:36 2022] audit: type=1334 audit(1646800657.046:196): prog-id=0 op=UNLOAD
[Wed Mar  9 12:37:36 2022] audit: type=1334 audit(1646800657.046:197): prog-id=0 op=UNLOAD

gdb bt:

gdb python

GNU gdb (GDB) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".

Type "apropos word" to search for commands related to "word"...
Reading symbols from python...

(gdb) run train_script.py
Starting program: ~/.conda/envs/Torch/bin/python train_script.py

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fff285ff640 (LWP 3949)]
[New Thread 0x7fff27dfe640 (LWP 3950)]
[New Thread 0x7fff255fd640 (LWP 3951)]
[New Thread 0x7fff22dfc640 (LWP 3952)]
[New Thread 0x7fff1e5fb640 (LWP 3953)]
[New Thread 0x7fff1bdfa640 (LWP 3954)]
[New Thread 0x7fff1b5f9640 (LWP 3955)]
[New Thread 0x7fff16df8640 (LWP 3956)]
[New Thread 0x7fff145f7640 (LWP 3957)]
[New Thread 0x7fff13df6640 (LWP 3958)]
[New Thread 0x7fff0f5f5640 (LWP 3959)]
[New Thread 0x7fff0cdf4640 (LWP 3960)]
[New Thread 0x7fff0a5f3640 (LWP 3961)]
[New Thread 0x7fff07df2640 (LWP 3962)]
[New Thread 0x7fff055f1640 (LWP 3963)]
[New Thread 0x7fff02df0640 (LWP 3964)]
[New Thread 0x7fff005ef640 (LWP 3965)]
[New Thread 0x7ffefddee640 (LWP 3966)]
[New Thread 0x7ffefb5ed640 (LWP 3967)]
[New Thread 0x7ffef00a4640 (LWP 3975)]
[New Thread 0x7ffeef8a3640 (LWP 3976)]
[New Thread 0x7ffeed0a2640 (LWP 3977)]
[New Thread 0x7ffee88a1640 (LWP 3978)]
[New Thread 0x7ffee80a0640 (LWP 3979)]
[New Thread 0x7ffee589f640 (LWP 3980)]
[New Thread 0x7ffee309e640 (LWP 3981)]
[New Thread 0x7ffee089d640 (LWP 3982)]
[New Thread 0x7ffedc09c640 (LWP 3983)]
[New Thread 0x7ffedb89b640 (LWP 3984)]
[New Thread 0x7ffed709a640 (LWP 3985)]
[New Thread 0x7ffed4899640 (LWP 3986)]
[New Thread 0x7ffed2098640 (LWP 3987)]
[New Thread 0x7ffecf897640 (LWP 3988)]
[New Thread 0x7ffecd096640 (LWP 3989)]
[New Thread 0x7ffeca895640 (LWP 3990)]
[New Thread 0x7ffec8094640 (LWP 3991)]
[New Thread 0x7ffec5893640 (LWP 3992)]
[New Thread 0x7ffec3092640 (LWP 3993)]
[New Thread 0x7ffeaed34640 (LWP 3994)]
[New Thread 0x7ffeae533640 (LWP 3995)]
[New Thread 0x7ffeadd32640 (LWP 3996)]
[New Thread 0x7ffead531640 (LWP 3997)]
[New Thread 0x7ffeacd30640 (LWP 3998)]
[New Thread 0x7ffeac52f640 (LWP 3999)]
[New Thread 0x7ffeabd2e640 (LWP 4000)]
[New Thread 0x7ffeab52d640 (LWP 4001)]
[New Thread 0x7ffeaad2c640 (LWP 4002)]
[New Thread 0x7ffeaa52b640 (LWP 4003)]
[New Thread 0x7ffea9d2a640 (LWP 4004)]
[New Thread 0x7ffeb3bd5640 (LWP 4007)]
[New Thread 0x7ffeb33d4640 (LWP 4008)]
[New Thread 0x7ffeb2bd3640 (LWP 4009)]
[New Thread 0x7ffebdcda640 (LWP 4024)]
Epoch 1/2000:   0%|                                                                                      | 0/800 [00:00<?, ?it/s<class 'dict'>]
[Thread 0x7ffec3092640 (LWP 3993) exited]
[Thread 0x7ffec5893640 (LWP 3992) exited]
[Thread 0x7ffec8094640 (LWP 3991) exited]
[Thread 0x7ffeca895640 (LWP 3990) exited]
[Thread 0x7ffecd096640 (LWP 3989) exited]
[Thread 0x7ffecf897640 (LWP 3988) exited]
[Thread 0x7ffed2098640 (LWP 3987) exited]
[Thread 0x7ffed4899640 (LWP 3986) exited]
[Thread 0x7ffed709a640 (LWP 3985) exited]
[Thread 0x7ffedb89b640 (LWP 3984) exited]
[Thread 0x7ffedc09c640 (LWP 3983) exited]
[Thread 0x7ffee089d640 (LWP 3982) exited]
[Thread 0x7ffee309e640 (LWP 3981) exited]
[Thread 0x7ffee589f640 (LWP 3980) exited]
[Thread 0x7ffee80a0640 (LWP 3979) exited]
[Thread 0x7ffee88a1640 (LWP 3978) exited]
[Thread 0x7ffeed0a2640 (LWP 3977) exited]
[Thread 0x7ffeef8a3640 (LWP 3976) exited]
[Thread 0x7ffef00a4640 (LWP 3975) exited]
[Thread 0x7ffefb5ed640 (LWP 3967) exited]
[Thread 0x7ffefddee640 (LWP 3966) exited]
[Thread 0x7fff005ef640 (LWP 3965) exited]
[Thread 0x7fff02df0640 (LWP 3964) exited]
[Thread 0x7fff055f1640 (LWP 3963) exited]
[Thread 0x7fff07df2640 (LWP 3962) exited]
[Thread 0x7fff0a5f3640 (LWP 3961) exited]
[Thread 0x7fff0cdf4640 (LWP 3960) exited]
[Thread 0x7fff0f5f5640 (LWP 3959) exited]
[Thread 0x7fff13df6640 (LWP 3958) exited]
[Thread 0x7fff145f7640 (LWP 3957) exited]
[Thread 0x7fff16df8640 (LWP 3956) exited]
[Thread 0x7fff1b5f9640 (LWP 3955) exited]
[Thread 0x7fff1bdfa640 (LWP 3954) exited]
[Thread 0x7fff1e5fb640 (LWP 3953) exited]
[Thread 0x7fff22dfc640 (LWP 3952) exited]
[Thread 0x7fff255fd640 (LWP 3951) exited]
[Thread 0x7fff27dfe640 (LWP 3950) exited]
[Thread 0x7fff285ff640 (LWP 3949) exited]
[Detaching after fork from child process 4025]
[Detaching after fork from child process 4045]
[Detaching after fork from child process 4065]
[Detaching after fork from child process 4085]
[New Thread 0x7ffefb5ed640 (LWP 4086)]
[New Thread 0x7ffefddee640 (LWP 4105)]
[New Thread 0x7fff005ef640 (LWP 4107)]
[New Thread 0x7fff02df0640 (LWP 4108)]
[New Thread 0x7fff255fd640 (LWP 4117)]
Epoch 1/2000: 801it [02:08,  6.25it/s, Total=0.686, f_h=0.166, hd_h=0.52, s/step=0.328]                                                        
Start Validation
Epoch 1/2000:   0%|                                                                                       | 0/99 [00:00<?, ?it/s<class 'dict'>]
[Thread 0x7ffefb5ed640 (LWP 4086) exited]
[Thread 0x7ffefddee640 (LWP 4105) exited]
[Thread 0x7fff02df0640 (LWP 4108) exited]
[Thread 0x7fff005ef640 (LWP 4107) exited]
[Detaching after fork from child process 4426]
[Detaching after fork from child process 4446]
[New Thread 0x7fff005ef640 (LWP 4447)]
[New Thread 0x7fff02df0640 (LWP 4448)]
Epoch 1/2000: 100%|█████████████████████████████████████████████████████| 99/99 [00:11<00:00,  8.82it/s, f_h=0.197, hd_h=13.1, total_loss=13.3]
[Thread 0x7fff005ef640 (LWP 4447) exited]
[Thread 0x7fff02df0640 (LWP 4448) exited]
[New Thread 0x7fff02df0640 (LWP 4496)]
[New Thread 0x7fff005ef640 (LWP 4497)]
[New Thread 0x7ffefddee640 (LWP 4498)]
[New Thread 0x7ffefb5ed640 (LWP 4499)]
[New Thread 0x7fff1e5fb640 (LWP 4500)]
[New Thread 0x7fff1bdfa640 (LWP 4501)]
[New Thread 0x7fff1b5f9640 (LWP 4502)]
[New Thread 0x7fff16df8640 (LWP 4503)]
[New Thread 0x7fff145f7640 (LWP 4504)]
[New Thread 0x7fff13df6640 (LWP 4505)]
[New Thread 0x7fff0f5f5640 (LWP 4506)]
[New Thread 0x7fff0cdf4640 (LWP 4507)]
[New Thread 0x7fff0a5f3640 (LWP 4508)]
[New Thread 0x7fff07df2640 (LWP 4509)]
[New Thread 0x7fff055f1640 (LWP 4510)]
[New Thread 0x7ffef00a4640 (LWP 4511)]
[New Thread 0x7ffeef8a3640 (LWP 4512)]
[New Thread 0x7ffeed0a2640 (LWP 4513)]
[New Thread 0x7ffee88a1640 (LWP 4514)]
Finish Validation
Epoch:1/2000
Total Loss: 0.2286 || Val Loss: 4.3819 
Saving state, iter: 1
Epoch 2/2000:   0%|                                                                                      | 0/800 [00:00<?, ?it/s<class 'dict'>]
[Thread 0x7ffee88a1640 (LWP 4514) exited]
[Thread 0x7ffeed0a2640 (LWP 4513) exited]
[Thread 0x7ffeef8a3640 (LWP 4512) exited]
[Thread 0x7ffef00a4640 (LWP 4511) exited]
[Thread 0x7fff055f1640 (LWP 4510) exited]
[Thread 0x7fff07df2640 (LWP 4509) exited]
[Thread 0x7fff0a5f3640 (LWP 4508) exited]
[Thread 0x7fff0cdf4640 (LWP 4507) exited]
[Thread 0x7fff0f5f5640 (LWP 4506) exited]
[Thread 0x7fff13df6640 (LWP 4505) exited]
[Thread 0x7fff145f7640 (LWP 4504) exited]
[Thread 0x7fff16df8640 (LWP 4503) exited]
[Thread 0x7fff1b5f9640 (LWP 4502) exited]
[Thread 0x7fff1bdfa640 (LWP 4501) exited]
[Thread 0x7fff1e5fb640 (LWP 4500) exited]
[Thread 0x7ffefb5ed640 (LWP 4499) exited]
[Thread 0x7ffefddee640 (LWP 4498) exited]
[Thread 0x7fff005ef640 (LWP 4497) exited]
[Thread 0x7fff02df0640 (LWP 4496) exited]
[Detaching after fork from child process 4527]
[Detaching after fork from child process 4547]
[Detaching after fork from child process 4567]
[Detaching after fork from child process 4587]
[New Thread 0x7ffee88a1640 (LWP 4588)]
[New Thread 0x7ffeed0a2640 (LWP 4589)]
[New Thread 0x7ffeef8a3640 (LWP 4604)]
[New Thread 0x7ffef00a4640 (LWP 4610)]
Epoch 2/2000:  95%|███████████████████████████████████████▊  | 759/800 [02:02<00:06,  6.21it/s, Total=14.6, f_h=0.157, hd_h=14.4, s/step=0.481]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
convertitem (arg=0x7ffeb3c0fcf0, p_format=0x7fffffffc528, p_va=0x7fffffffc790, flags=0, levels=0x7fffffffc5c0, msgbuf=<optimized out>, bufsize=256, freelist=0x7fffffffc530) at /tmp/build/80754af9/python-split_1631797238431/work/Python/getargs.c:601
601     /tmp/build/80754af9/python-split_1631797238431/work/Python/getargs.c: No such file or directory.

(gdb) bt

#0  convertitem (arg=0x7ffeb3c0fcf0, p_format=0x7fffffffc528, p_va=0x7fffffffc790, flags=0, levels=0x7fffffffc5c0, msgbuf=<optimized out>, 
    bufsize=256, freelist=0x7fffffffc530) at /tmp/build/80754af9/python-split_1631797238431/work/Python/getargs.c:601
#1  0x0000555555681bbf in vgetargs1_impl (compat_args=<optimized out>, stack=0x7fff2766c498, nargs=3, format=<optimized out>, 
    p_va=0x7fffffffc790, flags=0) at /tmp/build/80754af9/python-split_1631797238431/work/Python/getargs.c:391
#2  0x0000555555713346 in vgetargs1 (flags=0, p_va=0x7fffffffc790, format=<optimized out>, args=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/getargs.c:434
#3  PyArg_ParseTuple (args=<optimized out>, format=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/getargs.c:129
#4  0x00007ffebead89e6 in ?? ()
   from ~/.conda/envs/Torch/lib/python3.9/site-packages/scipy/ndimage/_nd_image.cpython-39-x86_64-linux-gnu.so
#5  0x00005555556c8738 in cfunction_call (func=0x7ffebeb586d0, args=<optimized out>, kwargs=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/methodobject.c:552
#6  0x00005555556989ef in _PyObject_MakeTpCall (tstate=0x5555559183f0, callable=0x7ffebeb586d0, args=<optimized out>, nargs=<optimized out>, 
    keywords=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:191
#7  0x0000555555722d89 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x55555a9cf078, callable=<optimized out>, 
    tstate=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:116
#8  PyObject_Vectorcall () at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:127
#9  call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555559183f0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5075
--Type <RET> for more, q to quit, c to continue without paging--
#10 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x55555a9ceea0, throwflag=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3487
#11 0x00005555556d68e2 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#12 _PyEval_EvalCode (tstate=<optimized out>, _co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kwnames=0x0, kwargs=0x7ffeaed65930, kwcount=0, kwstep=1, defs=0x7ffebeb71aa8, defcount=5, kwdefs=0x0, 
    closure=0x0, name=<optimized out>, qualname=0x7ffef3b5dc60) at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4327
#13 0x00005555556d7527 in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, 
    kwnames=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:396
#14 0x000055555564eebb in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffeaed65928, callable=0x7ffebea3f3a0, 
    tstate=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:118
#15 PyObject_Vectorcall () at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:127
#16 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555559183f0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5075
#17 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffeaed65780, throwflag=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3518
#18 0x0000555555702074 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#19 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>, tstate=0x5555559183f0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:330
#20 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7ffeaed47c08, func=0x7ffebddaf280)
--Type <RET> for more, q to quit, c to continue without paging--
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:367
#21 _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7ffeaed47c08, callable=0x7ffebddaf280, 
    tstate=0x5555559183f0) at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:118
#22 method_vectorcall (method=<optimized out>, args=0x7ffeaed47c10, nargsf=<optimized out>, kwnames=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/classobject.c:53
#23 0x000055555564dfdc in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffeaed47c10, callable=0x7fff22ddbc40, 
    tstate=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:118
#24 PyObject_Vectorcall () at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:127
#25 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555559183f0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5075
#26 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffeaed47a40, throwflag=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3487
#27 0x00005555556d7753 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#28 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>, tstate=0x5555559183f0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:330
#29 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:367
#30 0x000055555570295b in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=3, args=0x7fffffffd0a0, callable=0x7ffebddaf310, 
    tstate=0x5555559183f0) at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:118
--Type <RET> for more, q to quit, c to continue without paging--
#31 method_vectorcall (method=<optimized out>, args=0x7fff276900d8, nargsf=<optimized out>, kwnames=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/classobject.c:83
#32 0x000055555568c8f8 in PyVectorcall_Call (kwargs=<optimized out>, tuple=0x7fff276900c0, callable=0x7fff27699240)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:231
#33 _PyObject_Call (tstate=<optimized out>, callable=0x7fff27699240, args=0x7fff276900c0, kwargs=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:266
#34 0x0000555555720740 in PyObject_Call (kwargs=0x7fff2768d580, args=0x7fff276900c0, callable=0x7fff27699240)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:293
#35 do_call_core (kwdict=0x7fff2768d580, callargs=0x7fff276900c0, func=0x7fff27699240, tstate=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5123
#36 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x5556487959e0, throwflag=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3580
#37 0x00005555556d68e2 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#38 _PyEval_EvalCode (tstate=<optimized out>, _co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kwnames=0x0, kwargs=0x7fffffffd488, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, 
    name=<optimized out>, qualname=0x7ffff3f6d080) at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4327
#39 0x00005555556d7527 in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, 
    kwnames=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:396
#40 0x00005555556c87ca in _PyObject_FastCallDictTstate (tstate=0x5555559183f0, callable=0x7ffff3f03310, args=<optimized out>, 
--Type <RET> for more, q to quit, c to continue without paging--
    nargsf=<optimized out>, kwargs=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:118
#41 0x00005555556d2275 in _PyObject_Call_Prepend (kwargs=0x0, args=0x7fff2766cc80, obj=<optimized out>, callable=0x7ffff3f03310, 
    tstate=0x5555559183f0) at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:489
#42 slot_tp_call (self=<optimized out>, args=0x7fff2766cc80, kwds=0x0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/typeobject.c:6718
#43 0x00005555556989ef in _PyObject_MakeTpCall (tstate=0x5555559183f0, callable=0x7ffeb509efa0, args=<optimized out>, nargs=<optimized out>, 
    keywords=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:191
#44 0x000055555571e8b4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x555640987b78, callable=<optimized out>, 
    tstate=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:116
#45 PyObject_Vectorcall () at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:127
#46 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555559183f0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5075
#47 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x5556409878c0, throwflag=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3518
#48 0x00005555556d68e2 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#49 _PyEval_EvalCode (tstate=<optimized out>, _co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kwnames=0x7ffff7452a18, kwargs=0x555555992378, kwcount=2, kwstep=1, defs=0x7ffff75289e8, defcount=1, 
    kwdefs=0x0, closure=0x0, name=<optimized out>, qualname=0x7ffff743a4f0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4327
--Type <RET> for more, q to quit, c to continue without paging--
#50 0x00005555556d7527 in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, 
    kwnames=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:396
#51 0x000055555564e90b in _PyObject_VectorcallTstate (kwnames=0x7ffff7452a00, nargsf=<optimized out>, args=<optimized out>, 
    callable=0x7ffff7435ee0, tstate=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:118
#52 PyObject_Vectorcall () at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:127
#53 call_function (kwnames=0x7ffff7452a00, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5075
#54 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x5555559921c0, throwflag=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3535
#55 0x00005555556d68e2 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#56 _PyEval_EvalCode (tstate=<optimized out>, _co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kwnames=0x0, kwargs=0x0, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, 
    name=<optimized out>, qualname=0x0) at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4327
#57 0x0000555555788bac in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, 
    kwcount=0, kwargs=<optimized out>, kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=<optimized out>, 
    globals=<optimized out>, _co=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4359
#58 PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, 
    kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4375
--Type <RET> for more, q to quit, c to continue without paging--
#59 0x00005555556d79eb in PyEval_EvalCode (co=co@entry=0x7ffff754f500, globals=globals@entry=0x7ffff75d5e40, 
    locals=locals@entry=0x7ffff75d5e40) at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:826
#60 0x0000555555788c5b in run_eval_code_obj (tstate=0x5555559183f0, co=0x7ffff754f500, globals=0x7ffff75d5e40, locals=0x7ffff75d5e40)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:1219
#61 0x00005555557bc705 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff75d5e40, locals=0x7ffff75d5e40, 
    flags=<optimized out>, arena=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:1240
#62 0x000055555566160d in pyrun_file (fp=0x555555911470, filename=0x7ffff7435a50, start=<optimized out>, globals=0x7ffff75d5e40, 
    locals=0x7ffff75d5e40, closeit=1, flags=0x7fffffffdb68) at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:1138
#63 0x00005555557c149f in pyrun_simple_file (flags=0x7fffffffdb68, closeit=1, filename=0x7ffff7435a50, fp=0x555555911470)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:449
#64 PyRun_SimpleFileExFlags (fp=0x555555911470, filename=<optimized out>, closeit=1, flags=0x7fffffffdb68)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:482
#65 0x00005555557c1c7f in pymain_run_file (cf=0x7fffffffdb68, config=0x555555916b50)
    at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:379
#66 pymain_run_python (exitcode=0x7fffffffdb60) at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:604
#67 Py_RunMain () at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:683
#68 0x00005555557c1d79 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:1129
#69 0x00007ffff7cc3310 in __libc_start_call_main () from /usr/lib/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
#70 0x00007ffff7cc33c1 in __libc_start_main_impl () from /usr/lib/libc.so.6
#71 0x0000555555746bc3 in _start ()

Versions

(Same problem occurs on python 3.9)

System:
   python: 3.9.7 (default, Sep 16 2021, 13:09:58)  [GCC 7.5.0]
   executable: /home/zhcsoft/anaconda3/envs/Torch/bin/python
   machine: Linux-5.16.11-2-MANJARO-x86_64-with-glibc2.35

Python dependencies:
          pip: 22.0.3
   setuptools: 58.0.4
      sklearn: 1.0.2
        numpy: 1.22.3
        scipy: 1.8.0
       Cython: None
       pandas: 1.4.1
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True
@ZHCSOFT ZHCSOFT added Bug Needs Triage Issue requires triage labels Mar 10, 2022
@ZHCSOFT ZHCSOFT changed the title random segfaults on distance_transform_edt when using Intel 12 Alder lake (E-Core enabled) random Segfaults on distance_transform_edt with Intel 12 Alder lake (E-Core enabled) Mar 10, 2022
@ogrisel
Copy link
Member

ogrisel commented Mar 10, 2022

Could you please run:

python -m threadpoolctl -i numpy -i scipy.linalg

and report the results?

@ogrisel
Copy link
Member

ogrisel commented Mar 10, 2022

It seems that this problem is unrelated to scikit-learn because it happens in a scipy call and I cannot see any sklearn related call in the tracebacks:

File “.conda/envs/Torch/lib/python3.8/site-packages/scipy/ndimage/morphology.py”, line 2299 in distance_transform_edt

but it could be related to OpenBLAS or numpy that do have CPU-specific optimizations. Hence the output of threadpoolctl might help clarify which OpenBLAS version you are using.

@ZHCSOFT
Copy link
Author

ZHCSOFT commented Mar 10, 2022

It seems that this problem is unrelated to scikit-learn because it happens in a scipy call and I cannot see any sklearn related call in the tracebacks:

File “.conda/envs/Torch/lib/python3.8/site-packages/scipy/ndimage/morphology.py”, line 2299 in distance_transform_edt

but it could be related to OpenBLAS or numpy that do have CPU-specific optimizations. Hence the output of threadpoolctl might help clarify which OpenBLAS version you are using.

Here is output of these information, maybe I should upgrade OpenBLAS to 0.3.20 ?

python -m threadpoolctl -i sklearn
[
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libgomp",
    "filepath": "~/anaconda3/envs/Torch/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0",
    "version": null,
    "num_threads": 20
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "~/anaconda3/envs/Torch/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-2f7c42d4.3.18.so",
    "version": "0.3.18",
    "threading_layer": "pthreads",
    "architecture": "Haswell",
    "num_threads": 20
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "~/anaconda3/envs/Torch/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "Haswell",
    "num_threads": 20
  }
]

@thomasjpfan thomasjpfan removed the Needs Triage Issue requires triage label Mar 10, 2022
@jeremiedbb
Copy link
Member

Here is output of these information, maybe I should upgrade OpenBLAS to 0.3.20 ?

Having numpy and scipy using different versions of openblas has not been an issue so far so I don't think it has anything to do with the issue you're facing.

The backtrace and the snippet don't involve scikit-learn at all. This issue should probably be posted on the scipy issue tracker. I'm closing it here.

@ZHCSOFT
Copy link
Author

ZHCSOFT commented Apr 3, 2022

I have confirmed that DRAM issue caused this problem, not OpenBLAS or MKL. But it seems that training with MKL runs slower than OpenBLAS library with Intel 12th CPU and linux 5.17 (3.5 vs 4.0 iteration per sec) ...

Due to some compatibility related reason, once I enable X.M.P. built-in memory over-clocking function on BIOS, random kernel panic and CRC error occurred when running without loading external program or unzipping especially running Windows Server. After I turn off X.M.P. over-clocking and change DRAM, it works well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants