Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in python google cloud libraries #13327

Closed
edgartanaka opened this issue Nov 9, 2017 · 31 comments
Closed

Segmentation fault in python google cloud libraries #13327

edgartanaka opened this issue Nov 9, 2017 · 31 comments

Comments

@edgartanaka
Copy link

edgartanaka commented Nov 9, 2017

Please answer these questions before submitting your issue.

What version of gRPC and what language are you using?

I'm using Python. We are using several python gcloud libs:
google-api-core==0.1.1
google-auth==1.1.1
google-cloud==0.29.0
google-cloud-bigquery==0.27.0
google-cloud-bigtable==0.28.1
google-cloud-core==0.27.1
google-cloud-datastore==1.4.0
google-cloud-dns==0.28.0
google-cloud-error-reporting==0.28.0
google-cloud-firestore==0.28.0
google-cloud-language==0.31.0
google-cloud-logging==1.4.0
google-cloud-monitoring==0.28.0
google-cloud-pubsub==0.29.0
google-cloud-resource-manager==0.28.0
google-cloud-runtimeconfig==0.28.0
google-cloud-spanner==0.29.0
google-cloud-speech==0.30.0
google-cloud-storage==1.6.0
google-cloud-trace==0.16.0
google-cloud-translate==1.3.0
google-cloud-videointelligence==0.28.0
google-cloud-vision==0.28.0
google-gax==0.15.15
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
grpc-google-iam-v1==0.11.4
grpcio==1.7.0

What operating system (Linux, Windows, …) and version?

(venv) tanakaed@triage-bot:~/server$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

What runtime / compiler are you using (e.g. python version or version of gcc)

Python info:
`{noformat}
(venv) tanakaed@triage-bot:/server$ python --version
Python 3.6.2 :: Anaconda, Inc.
(venv) tanakaed@triage-bot:
/server$ conda info
Current conda install:

           platform : linux-64
      conda version : 4.3.27
   conda is private : False
  conda-env version : 4.3.27
conda-build version : 3.0.23
     python version : 3.6.2.final.0
   requests version : 2.18.4
   root environment : /home/tanakaed/anaconda3  (writable)
default environment : /home/tanakaed/anaconda3/envs/venv
   envs directories : /home/tanakaed/anaconda3/envs
                      /home/tanakaed/.conda/envs
      package cache : /home/tanakaed/anaconda3/pkgs
                      /home/tanakaed/.conda/pkgs
       channel URLs : https://repo.continuum.io/pkgs/main/linux-64
                      https://repo.continuum.io/pkgs/main/noarch
                      https://repo.continuum.io/pkgs/free/linux-64
                      https://repo.continuum.io/pkgs/free/noarch
                      https://repo.continuum.io/pkgs/r/linux-64
                      https://repo.continuum.io/pkgs/r/noarch
                      https://repo.continuum.io/pkgs/pro/linux-64
                      https://repo.continuum.io/pkgs/pro/noarch
        config file : None
         netrc file : None
       offline mode : False
         user-agent : conda/4.3.27 requests/2.18.4 CPython/3.6.2 Linux/4.10.0-38-generic debian/stretch/sid glibc/2.23    
            UID:GID : 1001:1002

tanakaed@triage-bot:$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1
16.04.5) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.`

What did you do?

We have python code to both publish and pull messages from pubsub. We also have python that interfaces with google datastore and google logging. I don't know which one of these codes is triggering this segmentation fault. My code runs fine for a while but after ~60 mins running and processing some cases, a segfault is raised.
I ran my python script inside gdb and this is what I got:

Thread 9 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff054a700 (LWP 29398)]
gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.c:93
93 src/core/lib/support/sync.c: No such file or directory.
(gdb) backtrace
#0 gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.c:93
#1 0x00007ffff12c8365 in grpc_stream_ref (refcount=) at src/core/lib/transport/transport.c:50
#2 0x00007ffff12f3490 in send_security_metadata (batch=0x7fff8c0820f0, elem=0x7fff8c0821a0, exec_ctx=0x7ffff0549ec0)
at src/core/lib/security/transport/client_auth_filter.c:216
#3 on_host_checked (exec_ctx=exec_ctx@entry=0x7ffff0549ec0, arg=arg@entry=0x7fff8c0820f0, error=)
at src/core/lib/security/transport/client_auth_filter.c:231
#4 0x00007ffff12f396f in auth_start_transport_stream_op_batch (exec_ctx=0x7ffff0549ec0, elem=0x7fff8c0821a0, batch=0x7fff8c0820f0)
at src/core/lib/security/transport/client_auth_filter.c:316
#5 0x00007ffff1300f68 in waiting_for_pick_batches_resume (elem=, elem=, exec_ctx=0x7ffff0549ec0)
at src/core/ext/filters/client_channel/client_channel.c:953
#6 create_subchannel_call_locked (error=0x0, elem=, exec_ctx=0x7ffff0549ec0)
at src/core/ext/filters/client_channel/client_channel.c:1016
#7 pick_done_locked (exec_ctx=0x7ffff0549ec0, elem=, error=0x0) at src/core/ext/filters/client_channel/client_channel.c:1042
#8 0x00007ffff12932f3 in grpc_combiner_continue_exec_ctx (exec_ctx=exec_ctx@entry=0x7ffff0549ec0) at src/core/lib/iomgr/combiner.c:259
#9 0x00007ffff129bdf8 in grpc_exec_ctx_flush (exec_ctx=exec_ctx@entry=0x7ffff0549ec0) at src/core/lib/iomgr/exec_ctx.c:93
#10 0x00007ffff129c3c1 in run_closures (exec_ctx=0x7ffff0549ec0, list=...) at src/core/lib/iomgr/executor.c:81
#11 executor_thread (arg=arg@entry=0x5555565d3e00) at src/core/lib/iomgr/executor.c:181
#12 0x00007ffff1285c37 in thread_body (v=) at src/core/lib/support/thd_posix.c:53
#13 0x00007ffff7bc16ba in start_thread (arg=0x7ffff054a700) at pthread_create.c:333
#14 0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)

What did you expect to see?

No segmentation fault

@nathanielmanistaatgoogle
Copy link
Member

Are you able to isolate the problem any further? Are you able to reproduce the issue in a narrower space of time (than one hour) and narrower circumstances (fewer libraries involved)?

@waichee
Copy link

waichee commented Nov 19, 2017

Hi i am trying out the ML engine api using google-api-python-client==1.6.4 and seeing segmentation fault too after running the program which performs predictions continuously for a few minutes

Im getting the following error on OSX and also Ubuntu 14.04.5

I'm using the following sample codes from the GCP doc
https://cloud.google.com/ml-engine/docs/online-predict

The libraries im using:

autopep8==1.3.3
awscli==1.11.179
backports.weakref==1.0rc1
beautifulsoup4==4.5.3
bleach==1.5.0
boto==2.48.0
boto3==1.4.7
botocore==1.7.37
bz2file==0.98
certifi==2016.2.28
chardet==3.0.4
click==6.7
colorama==0.3.7
configparser==3.5.0
content-cues==0.1
docutils==0.13.1
enum34==1.1.6
flake8==3.3.0
Flask==0.12.2
funcsigs==1.0.2
futures==3.1.1
google-api-python-client==1.6.4
html5lib==0.9999999
httplib2==0.10.3
httpretty==0.8.10
idna==2.6
itsdangerous==0.24
Jinja2==2.10
jmespath==0.9.3
lxml==3.7.3
Markdown==2.6.9
MarkupSafe==1.0
mccabe==0.6.1
mock==2.0.0
moto==0.4.25
numpy==1.13.0
oauth2client==4.1.2
pandas==0.20.3
pbr==3.1.1
protobuf==3.5.0
py==1.5.2
pyasn1==0.2.3
pyasn1-modules==0.1.5
pycodestyle==2.3.1
pyflakes==1.5.0
pytest==3.2.3
python-dateutil==2.6.0
python-gnupg==0.3.8
pytz==2017.2
PyYAML==3.12
requests==2.18.4
retrying==1.3.3
rsa==3.4.2
s3fs==0.1.2
s3transfer==0.1.10
scikit-learn==0.18.2
scipy==0.19.1
six==1.10.0
smart-open==1.5.3
tensorflow==1.2.1
tensorflow-serving-api==0.6.0
uritemplate==3.0.0
urllib3==1.22
Werkzeug==0.12.2
xmltodict==0.11.0
python --version
Python 2.7.13 :: Continuum Analytics, Inc.

@scotloach
Copy link

I am seeing the same segmentation fault.
Python 3.6.2

#0  gpr_ref (r=0x0) at src/core/lib/support/sync.cc:88
#1  gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.cc:93
#2  0x00007fbc20765228 in send_security_metadata (batch=0x56483e6436a0, elem=0x56483e643750, exec_ctx=0x7fbc1e77fea0) at src/core/lib/security/transport/client_auth_filter.cc:225
#3  on_host_checked (exec_ctx=exec_ctx@entry=0x7fbc1e77fea0, arg=arg@entry=0x56483e6436a0, error=<optimized out>) at src/core/lib/security/transport/client_auth_filter.cc:241
#4  0x00007fbc20765767 in auth_start_transport_stream_op_batch (exec_ctx=0x7fbc1e77fea0, elem=0x56483e643750, batch=0x56483e6436a0) at src/core/lib/security/transport/client_auth_filter.cc:328
#5  0x00007fbc20773c39 in waiting_for_pick_batches_resume (elem=<optimized out>, elem=<optimized out>, exec_ctx=0x7fbc1e77fea0) at src/core/ext/filters/client_channel/client_channel.cc:962
#6  create_subchannel_call_locked (error=0x0, elem=<optimized out>, exec_ctx=0x7fbc1e77fea0) at src/core/ext/filters/client_channel/client_channel.cc:1025
#7  pick_done_locked (exec_ctx=0x7fbc1e77fea0, elem=<optimized out>, error=0x0) at src/core/ext/filters/client_channel/client_channel.cc:1051
#8  0x00007fbc20703982 in grpc_combiner_continue_exec_ctx (exec_ctx=0x7fbc1e77fea0) at src/core/lib/iomgr/combiner.cc:260
#9  0x00007fbc2070d332 in grpc_exec_ctx_flush (exec_ctx=0x7fbc1e77fea0) at src/core/lib/iomgr/exec_ctx.cc:93
#10 0x00007fbc2071a115 in run_some_timers (exec_ctx=0x7fbc1e77fea0) at src/core/lib/iomgr/timer_manager.cc:129
#11 timer_main_loop (exec_ctx=0x7fbc1e77fea0) at src/core/lib/iomgr/timer_manager.cc:231
#12 timer_thread (completed_thread_ptr=completed_thread_ptr@entry=0x56483e34ea90) at src/core/lib/iomgr/timer_manager.cc:279
#13 0x00007fbc206f64f7 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.cc:53
#14 0x00007fbc2769e36d in start_thread () from /lib64/libpthread.so.0
#15 0x00007fbc26cb9bbf in clone () from /lib64/libc.so.6

@nathanielmanistaatgoogle
Copy link
Member

@scotloach: have you any insight into how to simply reproduce the segmentation fault?

@scotloach
Copy link

I haven't found a way to "simply" reproduce it. I can reproduce it reliably in my pretty complex environment.

@seanlaff
Copy link

I think this might be a multithreading issue.
I'm on Python 3.6.3 and am seeing this issue when making GRPC requests across multiple threads. If I run my program with a single thread- the error does not happen. If I execute my code with a decent number of threads (8), I can get the error almost instantly.

My stack trace is similar:

#0  gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.cc:93
#1  0x00007ffff198dd95 in grpc_stream_ref (refcount=<optimized out>) at src/core/lib/transport/transport.cc:48
#2  0x00007ffff19b9fd0 in send_security_metadata (batch=0x555555cdc8a0, elem=0x555555cdc950, exec_ctx=0x7ffff7e328b0)
    at src/core/lib/security/transport/client_auth_filter.cc:225
#3  on_host_checked (exec_ctx=exec_ctx@entry=0x7ffff7e328b0, arg=arg@entry=0x555555cdc8a0, error=<optimized out>)
    at src/core/lib/security/transport/client_auth_filter.cc:241
#4  0x00007ffff19ba428 in auth_start_transport_stream_op_batch (exec_ctx=0x7ffff7e328b0, elem=0x555555cdc950, batch=0x555555cdc8a0)
    at src/core/lib/security/transport/client_auth_filter.cc:328
#5  0x00007ffff19c8be1 in waiting_for_pick_batches_resume (elem=<optimized out>, elem=<optimized out>, exec_ctx=0x7ffff7e328b0)
    at src/core/ext/filters/client_channel/client_channel.cc:962
#6  create_subchannel_call_locked (error=0x0, elem=<optimized out>, exec_ctx=0x7ffff7e328b0)
    at src/core/ext/filters/client_channel/client_channel.cc:1025
#7  pick_done_locked (exec_ctx=0x7ffff7e328b0, elem=<optimized out>, error=<optimized out>)
    at src/core/ext/filters/client_channel/client_channel.cc:1051
#8  0x00007ffff194f0fc in grpc_combiner_continue_exec_ctx (exec_ctx=exec_ctx@entry=0x7ffff7e328b0) at src/core/lib/iomgr/combiner.cc:260
#9  0x00007ffff1961ec2 in grpc_exec_ctx_flush (exec_ctx=exec_ctx@entry=0x7ffff7e328b0) at src/core/lib/iomgr/exec_ctx.cc:93
#10 0x00007ffff1962aa5 in run_closures (exec_ctx=0x7ffff7e328b0, list=...) at src/core/lib/iomgr/executor.cc:80
#11 executor_thread (arg=arg@entry=0x555555ca20a0) at src/core/lib/iomgr/executor.cc:180
#12 0x00007ffff19420e7 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.cc:53
#13 0x00007ffff7dc3b57 in ?? () from /lib/ld-musl-x86_64.so.1
#14 0x0000000000000000 in ?? ()

I'm making thousands of requests in a short window

@t00n
Copy link

t00n commented Dec 29, 2017

I have the same problem and a similar stacktrace. I have been able to reproduce the bug in a short snippet (see below).
I am using Arch Linux, Python 3.6.4 and the following libs

gapic-google-cloud-datastore-v1==0.15.3
google-api-core==0.1.3
google-auth==1.2.1
google-cloud-core==0.28.0
google-cloud-datastore==1.4.0
google-cloud-pubsub==0.30.1
google-gax==0.15.16
googleapis-common-protos==1.5.3
grpc-google-iam-v1==0.11.4
proto-google-cloud-datastore-v1==0.90.4

I have ran the following code 39 times and it segfaulted everytime with an average running time of 47 seconds. The subscription I am pulling from contains a few hundreds messages.
I think the bug is linked to the number of requests to the datastore because when I remove datastore_client.query(kind='TestMessage') , it segfaults far less often.
I have tried to reproduce the bug without using pubsub, by spawning 50 threads and querying the datastore but it never segfaulted.

import os
import random
from datetime import datetime
from google.cloud import pubsub, datastore

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'some-file.json'
subscriber_client = pubsub.SubscriberClient()
datastore_client = datastore.Client()
KEYS = [i for i in range(100)]

def consume(message):
    data = {}
    data['key1'] = random.choice(KEYS)
    data['time'] = datetime.now().isoformat()
    key = datastore_client.key('Test', str(data['key1']), 'TestMessage', str(data['time']))
    entity = datastore.Entity(key)
    entity.update(data)
    datastore_client.put(entity)

    query = datastore_client.query(kind='TestMessage')
    list(query.fetch())

subscription = subscriber_client.subscription_path("some-project", "some-subscription")
subscriber_client.subscribe(subscription, consume)

while True:
    pass

@krelst
Copy link

krelst commented Jan 15, 2018

We are seeing the exact same segmentation issue when using the google-cloud-pubsub library. However we only see this issue when running our script on kubernetes, locally it runs stable for > 15 minutes (haven't tried it to run any longer). On k8s it fails after +/- 45s. We use python 2.7.13 run it in a debian:jessie docker container, and google-cloud-pubsub==0.30.0 with grpcio==1.8.3.

@nghiaho12
Copy link

nghiaho12 commented Jan 17, 2018

I'm also seeing this exact problem with 10 worker threads writing to a Google Spanner table, which Google recommends as the way to write bulk data.

After some testing I think my specific problem was that I was using the same spanner instance in each of the threads. Having each thread create their own instance seems to have solved it.

@nathanielmanistaatgoogle
Copy link
Member

@t00n: is that as minimal as you've been able to make your reproduction? In particular is authentication required? Are mutative operations required or can the problem be observed with only read-only operations? How much server-side set-up is required to reproduce the problem?

@nathanielmanistaatgoogle
Copy link
Member

@t00n: are you able to run with GRPC_TRACE=all and GRPC_VERBOSITY=debug? Does anything interesting leap out of the output generated by running with those?

@mehrdada
Copy link
Member

mehrdada commented Jan 17, 2018

Can anyone try reproducing this with 1.8.4? There was a bugfix that went in 1.8.4 that might have resolved this.

(two related issues filed about segfault, not in Python, but core, that might be related (or not), just to keep them accessible: #14001 #14040)

@subhayu89
Copy link

Is there any update?In the issue #14040 also I am using multiple threads ...and call gets blocked and immediate core dump in arch64 linux while the same code runs fine in x86 arch..

@t00n
Copy link

t00n commented Jan 24, 2018

@nathanielmanistaatgoogle I managed to reduce the code to this

from google.cloud import pubsub, datastore

subscriber_client = pubsub.SubscriberClient()
datastore_client = datastore.Client(project='my-project')


def consume(message):
    key = datastore_client.key('Test', 1)
    entity = datastore.Entity(key)
    datastore_client.put(entity)


subscription = subscriber_client.subscription_path("my-project", "testsegfault")
subscriber_client.subscribe(subscription, consume)

while True:
    pass

It needs authentication to access the PubSub and to push entities in the Datastore. If I remove any of those 2, it does not segfault anymore.

Write operations in the Datastore are required.

You need a PubSub subscription containing ~100k unacked messages. I tried with 1000 and 10000 and it did not segfault.
You need an (empty) Datastore.

I ran the script with GRPC_TRACE=all GRPC_VERBOSITY=debug but I don't know what to look for in the output. Any directions you can give me ?

EDIT: I am using grpcio 1.8.4 and still having the bug

@tcwalther
Copy link

I've just run into this issue myself. It does seem to be a multithreading issue. This happens to me when enqueuing too many futures in PubSub.

Running the script in GDB shows that lots of new threads start and quit. Here's the last few lines before the crash:

[New Thread 0x7ffe3b86b700 (LWP 4006)]
[Thread 0x7ffe3b86b700 (LWP 4006) exited]
[New Thread 0x7ffdb856e700 (LWP 4007)]
[Thread 0x7ffdb856e700 (LWP 4007) exited]
[New Thread 0x7ffe3b86b700 (LWP 4008)]
[Thread 0x7ffe3b86b700 (LWP 4008) exited]
[New Thread 0x7ffdb856e700 (LWP 4009)]
[Thread 0x7ffdb856e700 (LWP 4009) exited]
[New Thread 0x7ffe3b86b700 (LWP 4010)]
[Thread 0x7ffe3b86b700 (LWP 4010) exited]
[New Thread 0x7ffdb856e700 (LWP 4011)]
[Thread 0x7ffdb856e700 (LWP 4011) exited]
[New Thread 0x7ffe3b86b700 (LWP 4012)]
[Thread 0x7ffe3b86b700 (LWP 4012) exited]
[New Thread 0x7ffdb856e700 (LWP 4013)]
[Thread 0x7ffdb856e700 (LWP 4013) exited]
[New Thread 0x7ffe3b86b700 (LWP 4014)]
[Thread 0x7ffe3b86b700 (LWP 4014) exited]
[New Thread 0x7ffdb856e700 (LWP 4015)]
[Thread 0x7ffdb856e700 (LWP 4015) exited]
[New Thread 0x7ffe3b86b700 (LWP 4016)]
[Thread 0x7ffe3b86b700 (LWP 4016) exited]
[New Thread 0x7ffdb856e700 (LWP 4017)]
[Thread 0x7ffdb856e700 (LWP 4017) exited]
[New Thread 0x7ffe3b86b700 (LWP 4018)]
[Thread 0x7ffe3b86b700 (LWP 4018) exited]
[New Thread 0x7ffdb856e700 (LWP 4019)]
[Thread 0x7ffdb856e700 (LWP 4019) exited]
[New Thread 0x7ffe3b86b700 (LWP 4020)]
[Thread 0x7ffe3b86b700 (LWP 4020) exited]
[New Thread 0x7ffdb856e700 (LWP 4021)]
[Thread 0x7ffdb856e700 (LWP 4021) exited]
[New Thread 0x7ffe3b86b700 (LWP 4022)]
[Thread 0x7ffe3b86b700 (LWP 4022) exited]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffcd2d6700 (LWP 1738)]
gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.cc:93
93      src/core/lib/support/sync.cc: No such file or directory.
(gdb) bt
#0  gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.cc:93
#1  0x00007fffec112459 in grpc_connected_subchannel_ref (c=0x0)
    at src/core/ext/filters/client_channel/subchannel.cc:169
#2  0x00007fffec125562 in pf_pick_locked (exec_ctx=<optimized out>, pol=0x7ffd05281380, pick_args=<optimized out>,
    target=0x7ffdb997e5b0, context=<optimized out>, user_data=<optimized out>, on_complete=0x7ffdb997e560)
    at src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc:182
#3  0x00007fffec10a796 in pick_callback_start_locked (exec_ctx=exec_ctx@entry=0x7fffcd2d4670,
    elem=elem@entry=0x7ffdb997e430) at src/core/ext/filters/client_channel/client_channel.cc:1147
#4  0x00007fffec10b472 in start_pick_locked (exec_ctx=0x7fffcd2d4670, arg=0x7ffdb997e430, ignored=<optimized out>)
    at src/core/ext/filters/client_channel/client_channel.cc:1306
#5  0x00007fffec09afc2 in grpc_combiner_continue_exec_ctx (exec_ctx=0x7fffcd2d4670)
    at src/core/lib/iomgr/combiner.cc:260
#6  0x00007fffec0a4982 in grpc_exec_ctx_flush (exec_ctx=0x7fffcd2d4670) at src/core/lib/iomgr/exec_ctx.cc:93
#7  0x00007fffec0a5489 in run_closures (exec_ctx=0x7fffcd2d4670, list=...) at src/core/lib/iomgr/executor.cc:80
#8  executor_thread (arg=arg@entry=0x7fffd8987800) at src/core/lib/iomgr/executor.cc:180
#9  0x00007fffec08db37 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.cc:53
#10 0x00007ffff7752184 in start_thread (arg=0x7fffcd2d6700) at pthread_create.c:312
#11 0x00007ffff747f03d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

@mehrdada
Copy link
Member

mehrdada commented Jan 30, 2018

I just talked to the core team about this and they believe there has been vast changes to the refcount logic recently and they have strong suspicions that this bug is fixed in the 1.9.0 branch. I just pushed the 1.9.0rc3 to PyPI, so please let us know if you are still seeing it in that RC.

@subhayu89
Copy link

Is the same done for C++ also?

@mehrdada
Copy link
Member

mehrdada commented Jan 30, 2018

@subhayu89 Yes, the changes are in C core and that layer is shared across languages (except Java and Go).

@krelst
Copy link

krelst commented Jan 30, 2018

Hi @mehrdada, I tried to use the grpcio==1.9.0rc3 package but am still seeing the same segmentation fault. Our script is run inside a docker debian jessie container, on gcloud kubernetes engine, with the following python environment:

arrow==0.12.0
asn1crypto==0.24.0
backports.functools-lru-cache==1.2.1
cachetools==2.0.1
certifi==2017.11.5
cffi==1.11.2
chardet==3.0.4
contextlib2==0.5.5
cryptography==2.1.4
dill==0.2.7.1
enum34==1.1.6
funcsigs==1.0.2
future==0.16.0
futures==3.2.0
gapic-google-cloud-logging-v2==0.91.3
google-api-core==0.1.3
google-api-python-client==1.6.4
google-auth==1.2.1
google-auth-httplib2==0.0.3
google-cloud-bigquery==0.28.0
google-cloud-bigtable==0.28.0
google-cloud-core==0.28.0
google-cloud-logging==1.4.0
google-cloud-pubsub==0.30.0
google-cloud-storage==1.6.0
google-gax==0.15.16
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
grpc-google-iam-v1==0.11.4
grpcio==1.9.0rc3
httplib2==0.10.3
idna==2.6
ipaddress==1.0.19
mock==2.0.0
namedlist==1.7
nose2==0.6.5
oauth2client==3.0.0
oi-db-config==1.8
oi-vault==1.2.4
pbr==3.1.1
ply==3.8
proto-google-cloud-logging-v2==0.91.3
protobuf==3.5.0.post1
psutil==5.4.3
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycountry==17.9.23
pycparser==2.18
PyMySQL==0.6.7
python-dateutil==2.6.1
pytz==2017.3
raven==6.2.1
requests==2.18.4
retrying==1.3.3
rsa==3.4.2
six==1.11.0
uritemplate==3.0.0
urllib3==1.22

In our script we both pull and push from/to pub/sub, write to bigtable and bigquery. So we have multiple different gcloud clients running at the same time, which might be of interest looking at @t00n's example script.

@subhayu89
Copy link

I0130 16:00:54.914150633 21085 ev_epoll1_linux.cc:114] grpc epoll fd: 6
D0130 16:00:54.914242217 21085 ev_posix.cc:128] Using polling engine: epoll1
D0130 16:00:54.914360617 21085 dns_resolver.cc:294] Using native dns resolver
Segmentation fault (core dumped)

Still facing the same issue with v1.9.0 rc3 as well

@mehrdada
Copy link
Member

Thanks, I have managed to create a new pubsub and datastore setup and repro this issue.

@mehrdada
Copy link
Member

mehrdada commented Jan 30, 2018

import time
from concurrent import futures
from google.cloud import datastore

datastore_client = datastore.Client(project='grpc-testing')

def consume(message):
  key = datastore_client.key('mmx-repro-13327-datastore-kind', 1)
  entity = datastore.Entity(key)
  datastore_client.put(entity)

with futures.ThreadPoolExecutor(max_workers=20) as executor:
  for i in range(1,100000):
    executor.submit(consume, i)

while True:
  time.sleep(60)

So far I have managed to eliminate the dependency on pubsub by replacing it with a thread pool. Does anyone happen to know if there's an older version of gRPC which they can't reproduce this on?

@subhayu89
Copy link

Yes with v1.6.0, v.1.7.1,v1.8.4 also crashes for C++

@mehrdada
Copy link
Member

Do you have a C++ repro?! That would be really helpful!

@subhayu89
Copy link

subhayu89 commented Jan 31, 2018

version
v1.9.0cr3

grpc logs
I0130 16:00:54.914150633 21085 ev_epoll1_linux.cc:114] grpc epoll fd: 6
D0130 16:00:54.914242217 21085 ev_posix.cc:128] Using polling engine: epoll1
D0130 16:00:54.914360617 21085 dns_resolver.cc:294] Using native dns resolver

gdb
#0 0x0000007fa19273b8 in gpr_unref () from /usr/local/lib/libgrpc.so.5
No symbol table info available.
#1 0x0000007fa18ce80c in grpc_cq_internal_unref(grpc_completion_queue*) ()
from /usr/local/lib/libgrpc.so.5
No symbol table info available.
#2 0x0000007fa18b3508 in grpc_core::ExecCtx::Flush() ()
from /usr/local/lib/libgrpc.so.5
No symbol table info available.
#3 0x0000007fa18cf79c in grpc_completion_queue_shutdown ()
from /usr/local/lib/libgrpc.so.5
No symbol table info available.
#4 0x0000007fa18cf84c in grpc_completion_queue_destroy ()
from /usr/local/lib/libgrpc.so.5
No symbol table info available.
#5 0x00000000007f1f80 in grpc::CompletionQueue::~CompletionQueue() ()
No symbol table info available.
#6 0x000000000083049c in grpc::internal::BlockingUnaryCallImpl<cc::rpc::StatusInfoNotification, cc::rpc::StatusInfoResponse>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::internal::RpcMethod const&, grpc::ClientContext*, cc::rpc::StatusInfoNotification const&, cc::rpc::StatusInfoResponse*) ()
No symbol table info available.
#7 0x000000000082f5b0 in grpc::Status grpc::internal::BlockingUnaryCall<cc::rpc::StatusInfoNotification, cc::rpc::StatusInfoResponse>(grpc::ChannelInterface*, ---Type to continue, or q to quit---
grpc::internal::RpcMethod const&, grpc::ClientContext*, cc::rpc::StatusInfoNotification const&, cc::rpc::StatusInfoResponse*) ()
No symbol table info available.
#8 0x000000000082e9b0 in cc::rpc::CCStatusInfo::Stub::SendStatusInfo(grpc::ClientContext*, cc::rpc::StatusInfoNotification const&, cc::rpc::StatusInfoResponse*) ()

@subhayu89
Copy link

for java and go is it not reproducible?

@mehrdada
Copy link
Member

@subhayu89 I am confused--I think you are talking about the other issue you filed? I don't think it's the same issue at all. Let's keep the discussions in this thread centered on the original post's issue.
Java and Go are implemented in stacks written in their own language. They are completely separate implementations, unlike Python.

@subhayu89
Copy link

ok lets keep discussions separate but refcount issue I am getting with c++ also @mehrdada ...and since you referenced it to this issue i replied...anyways lets not discuss on this issue apart from python..thanks for your reply anyways

@mehrdada
Copy link
Member

@subhayu89 Sorry for the confusion. To be clear, if there is a C++ bug with the same symptom (segfault in client_auth_filter.cc), please do bring it up it here.

@kpayson64
Copy link
Contributor

Here are some valgrind outputs:

==148285== 
==148285== Thread 7 gpr_executor:
==148285== Invalid read of size 8
==148285==    at 0xC1B1502: plugin_get_request_metadata(grpc_call_credentials*, grpc_polling_entity*, grpc_auth_metadata_context, grpc_credentials_mdelem_array*, grpc_closure*, grpc_error**) (plugin_credentials.cc:201)
==148285==    by 0xC1B305B: send_security_metadata (client_auth_filter.cc:197)
==148285==    by 0xC1B305B: on_host_checked(void*, grpc_error*) (client_auth_filter.cc:221)
==148285==    by 0xC1B3709: auth_start_transport_stream_op_batch(grpc_call_element*, grpc_transport_stream_op_batch*) (client_auth_filter.cc:306)
==148285==    by 0xC1C122C: waiting_for_pick_batches_resume (client_channel.cc:965)
==148285==    by 0xC1C122C: create_subchannel_call_locked (client_channel.cc:1028)
==148285==    by 0xC1C122C: pick_done_locked(grpc_call_element*, grpc_error*) (client_channel.cc:1053)
==148285==    by 0xC154593: grpc_combiner_continue_exec_ctx() (combiner.cc:261)
==148285==    by 0xC161A8E: grpc_core::ExecCtx::Flush() (exec_ctx.cc:128)
==148285==    by 0xC161B5D: run_closures(grpc_closure_list) [clone .isra.5] (executor.cc:80)
==148285==    by 0xC161CFA: executor_thread(void*) (executor.cc:177)
==148285==    by 0xC148D19: thread_body(void*) (thd_posix.cc:67)
==148285==    by 0x4E3F183: start_thread (pthread_create.c:312)
==148285==    by 0x515303C: clone (clone.S:111)
==148285==  Address 0x1b176dc0 is 4,288 bytes inside a block of size 21,047 free'd
==148285==    at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==148285==    by 0xC1455FF: gpr_arena_destroy(gpr_arena*) (arena.cc:64)
==148285==    by 0xC17879D: release_call(void*, grpc_error*) (call.cc:508)
==148285==    by 0xC16186E: exec_ctx_run(grpc_closure*, grpc_error*) (exec_ctx.cc:38)
==148285==    by 0xC161A76: grpc_core::ExecCtx::Flush() (exec_ctx.cc:125)
==148285==    by 0xC17C47A: ~ExecCtx (exec_ctx.h:85)
==148285==    by 0xC17C47A: grpc_call_unref (call.cc:598)
==148285==    by 0xC110906: __pyx_pf_4grpc_7_cython_6cygrpc_4Call_14__dealloc__ (cygrpc.c:6169)
==148285==    by 0xC110906: __pyx_pw_4grpc_7_cython_6cygrpc_4Call_15__dealloc__ (cygrpc.c:6141)
==148285==    by 0xC110906: __pyx_tp_dealloc_4grpc_7_cython_6cygrpc_Call (cygrpc.c:29590)
==148285==    by 0x5274C4: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285==    by 0x55FA14: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285==    by 0x4C2EBC: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285==    by 0x50B7F9: PyObject_Call (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285==    by 0x4B1EB9: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285== 
==148285== Invalid read of size 8
==148285==    at 0xC1B3064: send_security_metadata (client_auth_filter.cc:204)
==148285==    by 0xC1B3064: on_host_checked(void*, grpc_error*) (client_auth_filter.cc:221)
==148285==    by 0xC1B3709: auth_start_transport_stream_op_batch(grpc_call_element*, grpc_transport_stream_op_batch*) (client_auth_filter.cc:306)
==148285==    by 0xC1C122C: waiting_for_pick_batches_resume (client_channel.cc:965)
==148285==    by 0xC1C122C: create_subchannel_call_locked (client_channel.cc:1028)
==148285==    by 0xC1C122C: pick_done_locked(grpc_call_element*, grpc_error*) (client_channel.cc:1053)
==148285==    by 0xC154593: grpc_combiner_continue_exec_ctx() (combiner.cc:261)
==148285==    by 0xC161A8E: grpc_core::ExecCtx::Flush() (exec_ctx.cc:128)
==148285==    by 0xC161B5D: run_closures(grpc_closure_list) [clone .isra.5] (executor.cc:80)
==148285==    by 0xC161CFA: executor_thread(void*) (executor.cc:177)
==148285==    by 0xC148D19: thread_body(void*) (thd_posix.cc:67)
==148285==    by 0x4E3F183: start_thread (pthread_create.c:312)
==148285==    by 0x515303C: clone (clone.S:111)
==148285==  Address 0x1b176dc0 is 4,288 bytes inside a block of size 21,047 free'd
==148285==    at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==148285==    by 0xC1455FF: gpr_arena_destroy(gpr_arena*) (arena.cc:64)
==148285==    by 0xC17879D: release_call(void*, grpc_error*) (call.cc:508)
==148285==    by 0xC16186E: exec_ctx_run(grpc_closure*, grpc_error*) (exec_ctx.cc:38)
==148285==    by 0xC161A76: grpc_core::ExecCtx::Flush() (exec_ctx.cc:125)
==148285==    by 0xC17C47A: ~ExecCtx (exec_ctx.h:85)
==148285==    by 0xC17C47A: grpc_call_unref (call.cc:598)
==148285==    by 0xC110906: __pyx_pf_4grpc_7_cython_6cygrpc_4Call_14__dealloc__ (cygrpc.c:6169)
==148285==    by 0xC110906: __pyx_pw_4grpc_7_cython_6cygrpc_4Call_15__dealloc__ (cygrpc.c:6141)
==148285==    by 0xC110906: __pyx_tp_dealloc_4grpc_7_cython_6cygrpc_Call (cygrpc.c:29590)
==148285==    by 0x5274C4: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285==    by 0x55FA14: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285==    by 0x4C2EBC: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285==    by 0x50B7F9: PyObject_Call (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==148285==    by 0x4B1EB9: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)

=188462== 
==188462== Thread 3 grpc_executor:
==188462== Invalid read of size 8
==188462==    at 0xC1B1502: plugin_get_request_metadata(grpc_call_credentials*, grpc_polling_entity*, grpc_auth_metadata_context, grpc_credentials_mdelem_array*, grpc_closure*, grpc_error**) (plugin_credentials.cc:201)
==188462==    by 0xC1B305B: send_security_metadata (client_auth_filter.cc:197)
==188462==    by 0xC1B305B: on_host_checked(void*, grpc_error*) (client_auth_filter.cc:221)
==188462==    by 0xC1B3709: auth_start_transport_stream_op_batch(grpc_call_element*, grpc_transport_stream_op_batch*) (client_auth_filter.cc:306)
==188462==    by 0xC1C11EC: waiting_for_pick_batches_resume (client_channel.cc:965)
==188462==    by 0xC1C11EC: create_subchannel_call_locked (client_channel.cc:1025)
==188462==    by 0xC1C11EC: pick_done_locked(grpc_call_element*, grpc_error*) (client_channel.cc:1050)
==188462==    by 0xC154593: grpc_combiner_continue_exec_ctx() (combiner.cc:261)
==188462==    by 0xC161A8E: grpc_core::ExecCtx::Flush() (exec_ctx.cc:128)
==188462==    by 0xC161B5D: run_closures(grpc_closure_list) [clone .isra.5] (executor.cc:80)
==188462==    by 0xC161CFA: executor_thread(void*) (executor.cc:177)
==188462==    by 0xC148D19: thread_body(void*) (thd_posix.cc:67)
==188462==    by 0x4E3F183: start_thread (pthread_create.c:312)
==188462==    by 0x515303C: clone (clone.S:111)
==188462==  Address 0x2cf52e50 is 4,384 bytes inside a block of size 20,535 free'd
==188462==    at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==188462==    by 0xC1455FF: gpr_arena_destroy(gpr_arena*) (arena.cc:64)
==188462==    by 0xC17879D: release_call(void*, grpc_error*) (call.cc:508)
==188462==    by 0xC16186E: exec_ctx_run(grpc_closure*, grpc_error*) (exec_ctx.cc:38)
==188462==    by 0xC161A76: grpc_core::ExecCtx::Flush() (exec_ctx.cc:125)
==188462==    by 0xC17C47A: ~ExecCtx (exec_ctx.h:85)
==188462==    by 0xC17C47A: grpc_call_unref (call.cc:598)
==188462==    by 0xC110906: __pyx_pf_4grpc_7_cython_6cygrpc_4Call_14__dealloc__ (cygrpc.c:6169)
==188462==    by 0xC110906: __pyx_pw_4grpc_7_cython_6cygrpc_4Call_15__dealloc__ (cygrpc.c:6141)
==188462==    by 0xC110906: __pyx_tp_dealloc_4grpc_7_cython_6cygrpc_Call (cygrpc.c:29590)
==188462==    by 0x5274C4: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x55FA14: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x4C2EBC: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x50B7F9: PyObject_Call (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x4B1EB9: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462== 
==188462== Invalid read of size 8
==188462==    at 0xC1B3064: send_security_metadata (client_auth_filter.cc:204)
==188462==    by 0xC1B3064: on_host_checked(void*, grpc_error*) (client_auth_filter.cc:221)
==188462==    by 0xC1B3709: auth_start_transport_stream_op_batch(grpc_call_element*, grpc_transport_stream_op_batch*) (client_auth_filter.cc:306)
==188462==    by 0xC1C11EC: waiting_for_pick_batches_resume (client_channel.cc:965)
==188462==    by 0xC1C11EC: create_subchannel_call_locked (client_channel.cc:1025)
==188462==    by 0xC1C11EC: pick_done_locked(grpc_call_element*, grpc_error*) (client_channel.cc:1050)
==188462==    by 0xC154593: grpc_combiner_continue_exec_ctx() (combiner.cc:261)
==188462==    by 0xC161A8E: grpc_core::ExecCtx::Flush() (exec_ctx.cc:128)
==188462==    by 0xC161B5D: run_closures(grpc_closure_list) [clone .isra.5] (executor.cc:80)
==188462==    by 0xC161CFA: executor_thread(void*) (executor.cc:177)
==188462==    by 0xC148D19: thread_body(void*) (thd_posix.cc:67)
==188462==    by 0x4E3F183: start_thread (pthread_create.c:312)
==188462==    by 0x515303C: clone (clone.S:111)
==188462==  Address 0x2cf52e50 is 4,384 bytes inside a block of size 20,535 free'd
==188462==    at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==188462==    by 0xC1455FF: gpr_arena_destroy(gpr_arena*) (arena.cc:64)
==188462==    by 0xC17879D: release_call(void*, grpc_error*) (call.cc:508)
==188462==    by 0xC16186E: exec_ctx_run(grpc_closure*, grpc_error*) (exec_ctx.cc:38)
==188462==    by 0xC161A76: grpc_core::ExecCtx::Flush() (exec_ctx.cc:125)
==188462==    by 0xC17C47A: ~ExecCtx (exec_ctx.h:85)
==188462==    by 0xC17C47A: grpc_call_unref (call.cc:598)
==188462==    by 0xC110906: __pyx_pf_4grpc_7_cython_6cygrpc_4Call_14__dealloc__ (cygrpc.c:6169)
==188462==    by 0xC110906: __pyx_pw_4grpc_7_cython_6cygrpc_4Call_15__dealloc__ (cygrpc.c:6141)
==188462==    by 0xC110906: __pyx_tp_dealloc_4grpc_7_cython_6cygrpc_Call (cygrpc.c:29590)
==188462==    by 0x5274C4: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x55FA14: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x4C2EBC: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x50B7F9: PyObject_Call (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x4B1EB9: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462== 
==188462== Invalid read of size 8
==188462==    at 0xC148705: gpr_ref_non_zero (sync.cc:90)
==188462==    by 0xC1B307B: send_security_metadata (client_auth_filter.cc:206)
==188462==    by 0xC1B307B: on_host_checked(void*, grpc_error*) (client_auth_filter.cc:221)
==188462==    by 0xC1B3709: auth_start_transport_stream_op_batch(grpc_call_element*, grpc_transport_stream_op_batch*) (client_auth_filter.cc:306)
==188462==    by 0xC1C11EC: waiting_for_pick_batches_resume (client_channel.cc:965)
==188462==    by 0xC1C11EC: create_subchannel_call_locked (client_channel.cc:1025)
==188462==    by 0xC1C11EC: pick_done_locked(grpc_call_element*, grpc_error*) (client_channel.cc:1050)
==188462==    by 0xC154593: grpc_combiner_continue_exec_ctx() (combiner.cc:261)
==188462==    by 0xC161A8E: grpc_core::ExecCtx::Flush() (exec_ctx.cc:128)
==188462==    by 0xC161B5D: run_closures(grpc_closure_list) [clone .isra.5] (executor.cc:80)
==188462==    by 0xC161CFA: executor_thread(void*) (executor.cc:177)
==188462==    by 0xC148D19: thread_body(void*) (thd_posix.cc:67)
==188462==    by 0x4E3F183: start_thread (pthread_create.c:312)
==188462==    by 0x515303C: clone (clone.S:111)
==188462==  Address 0x2cf52d50 is 4,128 bytes inside a block of size 20,535 free'd
==188462==    at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==188462==    by 0xC1455FF: gpr_arena_destroy(gpr_arena*) (arena.cc:64)
==188462==    by 0xC17879D: release_call(void*, grpc_error*) (call.cc:508)
==188462==    by 0xC16186E: exec_ctx_run(grpc_closure*, grpc_error*) (exec_ctx.cc:38)
==188462==    by 0xC161A76: grpc_core::ExecCtx::Flush() (exec_ctx.cc:125)
==188462==    by 0xC17C47A: ~ExecCtx (exec_ctx.h:85)
==188462==    by 0xC17C47A: grpc_call_unref (call.cc:598)
==188462==    by 0xC110906: __pyx_pf_4grpc_7_cython_6cygrpc_4Call_14__dealloc__ (cygrpc.c:6169)
==188462==    by 0xC110906: __pyx_pw_4grpc_7_cython_6cygrpc_4Call_15__dealloc__ (cygrpc.c:6141)
==188462==    by 0xC110906: __pyx_tp_dealloc_4grpc_7_cython_6cygrpc_Call (cygrpc.c:29590)
==188462==    by 0x5274C4: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x55FA14: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x4C2EBC: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x50B7F9: PyObject_Call (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)
==188462==    by 0x4B1EB9: ??? (in /usr/local/google/home/kpayson/grpc/py34/bin/python3.4)

@kpayson64
Copy link
Contributor

We've cut a patch release for 1.9.1 containing the fix. Please reopen if the issue is not resolved.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests