New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in python google cloud libraries #13327
Comments
Are you able to isolate the problem any further? Are you able to reproduce the issue in a narrower space of time (than one hour) and narrower circumstances (fewer libraries involved)? |
Hi i am trying out the ML engine api using google-api-python-client==1.6.4 and seeing segmentation fault too after running the program which performs predictions continuously for a few minutes Im getting the following error on OSX and also Ubuntu 14.04.5 I'm using the following sample codes from the GCP doc The libraries im using:
|
I am seeing the same segmentation fault.
|
@scotloach: have you any insight into how to simply reproduce the segmentation fault? |
I haven't found a way to "simply" reproduce it. I can reproduce it reliably in my pretty complex environment. |
I think this might be a multithreading issue. My stack trace is similar:
I'm making thousands of requests in a short window |
I have the same problem and a similar stacktrace. I have been able to reproduce the bug in a short snippet (see below).
I have ran the following code 39 times and it segfaulted everytime with an average running time of 47 seconds. The subscription I am pulling from contains a few hundreds messages. import os
import random
from datetime import datetime
from google.cloud import pubsub, datastore
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'some-file.json'
subscriber_client = pubsub.SubscriberClient()
datastore_client = datastore.Client()
KEYS = [i for i in range(100)]
def consume(message):
data = {}
data['key1'] = random.choice(KEYS)
data['time'] = datetime.now().isoformat()
key = datastore_client.key('Test', str(data['key1']), 'TestMessage', str(data['time']))
entity = datastore.Entity(key)
entity.update(data)
datastore_client.put(entity)
query = datastore_client.query(kind='TestMessage')
list(query.fetch())
subscription = subscriber_client.subscription_path("some-project", "some-subscription")
subscriber_client.subscribe(subscription, consume)
while True:
pass |
We are seeing the exact same segmentation issue when using the |
I'm also seeing this exact problem with 10 worker threads writing to a Google Spanner table, which Google recommends as the way to write bulk data. After some testing I think my specific problem was that I was using the same spanner instance in each of the threads. Having each thread create their own instance seems to have solved it. |
@t00n: is that as minimal as you've been able to make your reproduction? In particular is authentication required? Are mutative operations required or can the problem be observed with only read-only operations? How much server-side set-up is required to reproduce the problem? |
@t00n: are you able to run with |
Is there any update?In the issue #14040 also I am using multiple threads ...and call gets blocked and immediate core dump in arch64 linux while the same code runs fine in x86 arch.. |
@nathanielmanistaatgoogle I managed to reduce the code to this from google.cloud import pubsub, datastore
subscriber_client = pubsub.SubscriberClient()
datastore_client = datastore.Client(project='my-project')
def consume(message):
key = datastore_client.key('Test', 1)
entity = datastore.Entity(key)
datastore_client.put(entity)
subscription = subscriber_client.subscription_path("my-project", "testsegfault")
subscriber_client.subscribe(subscription, consume)
while True:
pass It needs authentication to access the PubSub and to push entities in the Datastore. If I remove any of those 2, it does not segfault anymore. Write operations in the Datastore are required. You need a PubSub subscription containing ~100k unacked messages. I tried with 1000 and 10000 and it did not segfault. I ran the script with EDIT: I am using grpcio 1.8.4 and still having the bug |
I've just run into this issue myself. It does seem to be a multithreading issue. This happens to me when enqueuing too many futures in PubSub. Running the script in GDB shows that lots of new threads start and quit. Here's the last few lines before the crash: [New Thread 0x7ffe3b86b700 (LWP 4006)]
[Thread 0x7ffe3b86b700 (LWP 4006) exited]
[New Thread 0x7ffdb856e700 (LWP 4007)]
[Thread 0x7ffdb856e700 (LWP 4007) exited]
[New Thread 0x7ffe3b86b700 (LWP 4008)]
[Thread 0x7ffe3b86b700 (LWP 4008) exited]
[New Thread 0x7ffdb856e700 (LWP 4009)]
[Thread 0x7ffdb856e700 (LWP 4009) exited]
[New Thread 0x7ffe3b86b700 (LWP 4010)]
[Thread 0x7ffe3b86b700 (LWP 4010) exited]
[New Thread 0x7ffdb856e700 (LWP 4011)]
[Thread 0x7ffdb856e700 (LWP 4011) exited]
[New Thread 0x7ffe3b86b700 (LWP 4012)]
[Thread 0x7ffe3b86b700 (LWP 4012) exited]
[New Thread 0x7ffdb856e700 (LWP 4013)]
[Thread 0x7ffdb856e700 (LWP 4013) exited]
[New Thread 0x7ffe3b86b700 (LWP 4014)]
[Thread 0x7ffe3b86b700 (LWP 4014) exited]
[New Thread 0x7ffdb856e700 (LWP 4015)]
[Thread 0x7ffdb856e700 (LWP 4015) exited]
[New Thread 0x7ffe3b86b700 (LWP 4016)]
[Thread 0x7ffe3b86b700 (LWP 4016) exited]
[New Thread 0x7ffdb856e700 (LWP 4017)]
[Thread 0x7ffdb856e700 (LWP 4017) exited]
[New Thread 0x7ffe3b86b700 (LWP 4018)]
[Thread 0x7ffe3b86b700 (LWP 4018) exited]
[New Thread 0x7ffdb856e700 (LWP 4019)]
[Thread 0x7ffdb856e700 (LWP 4019) exited]
[New Thread 0x7ffe3b86b700 (LWP 4020)]
[Thread 0x7ffe3b86b700 (LWP 4020) exited]
[New Thread 0x7ffdb856e700 (LWP 4021)]
[Thread 0x7ffdb856e700 (LWP 4021) exited]
[New Thread 0x7ffe3b86b700 (LWP 4022)]
[Thread 0x7ffe3b86b700 (LWP 4022) exited]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffcd2d6700 (LWP 1738)]
gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.cc:93
93 src/core/lib/support/sync.cc: No such file or directory.
(gdb) bt
#0 gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.cc:93
#1 0x00007fffec112459 in grpc_connected_subchannel_ref (c=0x0)
at src/core/ext/filters/client_channel/subchannel.cc:169
#2 0x00007fffec125562 in pf_pick_locked (exec_ctx=<optimized out>, pol=0x7ffd05281380, pick_args=<optimized out>,
target=0x7ffdb997e5b0, context=<optimized out>, user_data=<optimized out>, on_complete=0x7ffdb997e560)
at src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc:182
#3 0x00007fffec10a796 in pick_callback_start_locked (exec_ctx=exec_ctx@entry=0x7fffcd2d4670,
elem=elem@entry=0x7ffdb997e430) at src/core/ext/filters/client_channel/client_channel.cc:1147
#4 0x00007fffec10b472 in start_pick_locked (exec_ctx=0x7fffcd2d4670, arg=0x7ffdb997e430, ignored=<optimized out>)
at src/core/ext/filters/client_channel/client_channel.cc:1306
#5 0x00007fffec09afc2 in grpc_combiner_continue_exec_ctx (exec_ctx=0x7fffcd2d4670)
at src/core/lib/iomgr/combiner.cc:260
#6 0x00007fffec0a4982 in grpc_exec_ctx_flush (exec_ctx=0x7fffcd2d4670) at src/core/lib/iomgr/exec_ctx.cc:93
#7 0x00007fffec0a5489 in run_closures (exec_ctx=0x7fffcd2d4670, list=...) at src/core/lib/iomgr/executor.cc:80
#8 executor_thread (arg=arg@entry=0x7fffd8987800) at src/core/lib/iomgr/executor.cc:180
#9 0x00007fffec08db37 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.cc:53
#10 0x00007ffff7752184 in start_thread (arg=0x7fffcd2d6700) at pthread_create.c:312
#11 0x00007ffff747f03d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) |
I just talked to the core team about this and they believe there has been vast changes to the refcount logic recently and they have strong suspicions that this bug is fixed in the 1.9.0 branch. I just pushed the 1.9.0rc3 to PyPI, so please let us know if you are still seeing it in that RC. |
Is the same done for C++ also? |
@subhayu89 Yes, the changes are in C core and that layer is shared across languages (except Java and Go). |
Hi @mehrdada, I tried to use the
In our script we both pull and push from/to pub/sub, write to bigtable and bigquery. So we have multiple different gcloud clients running at the same time, which might be of interest looking at @t00n's example script. |
I0130 16:00:54.914150633 21085 ev_epoll1_linux.cc:114] grpc epoll fd: 6 Still facing the same issue with v1.9.0 rc3 as well |
Thanks, I have managed to create a new pubsub and datastore setup and repro this issue. |
So far I have managed to eliminate the dependency on pubsub by replacing it with a thread pool. Does anyone happen to know if there's an older version of gRPC which they can't reproduce this on? |
Yes with v1.6.0, v.1.7.1,v1.8.4 also crashes for C++ |
Do you have a C++ repro?! That would be really helpful! |
version grpc logs gdb |
for java and go is it not reproducible? |
@subhayu89 I am confused--I think you are talking about the other issue you filed? I don't think it's the same issue at all. Let's keep the discussions in this thread centered on the original post's issue. |
ok lets keep discussions separate but refcount issue I am getting with c++ also @mehrdada ...and since you referenced it to this issue i replied...anyways lets not discuss on this issue apart from python..thanks for your reply anyways |
@subhayu89 Sorry for the confusion. To be clear, if there is a C++ bug with the same symptom (segfault in client_auth_filter.cc), please do bring it up it here. |
Here are some valgrind outputs:
|
We've cut a patch release for 1.9.1 containing the fix. Please reopen if the issue is not resolved. |
Please answer these questions before submitting your issue.
What version of gRPC and what language are you using?
I'm using Python. We are using several python gcloud libs:
google-api-core==0.1.1
google-auth==1.1.1
google-cloud==0.29.0
google-cloud-bigquery==0.27.0
google-cloud-bigtable==0.28.1
google-cloud-core==0.27.1
google-cloud-datastore==1.4.0
google-cloud-dns==0.28.0
google-cloud-error-reporting==0.28.0
google-cloud-firestore==0.28.0
google-cloud-language==0.31.0
google-cloud-logging==1.4.0
google-cloud-monitoring==0.28.0
google-cloud-pubsub==0.29.0
google-cloud-resource-manager==0.28.0
google-cloud-runtimeconfig==0.28.0
google-cloud-spanner==0.29.0
google-cloud-speech==0.30.0
google-cloud-storage==1.6.0
google-cloud-trace==0.16.0
google-cloud-translate==1.3.0
google-cloud-videointelligence==0.28.0
google-cloud-vision==0.28.0
google-gax==0.15.15
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
grpc-google-iam-v1==0.11.4
grpcio==1.7.0
What operating system (Linux, Windows, …) and version?
(venv) tanakaed@triage-bot:~/server$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial
What runtime / compiler are you using (e.g. python version or version of gcc)
Python info:
`{noformat}
(venv) tanakaed@triage-bot:
/server$ python --version/server$ conda infoPython 3.6.2 :: Anaconda, Inc.
(venv) tanakaed@triage-bot:
Current conda install:
tanakaed@triage-bot:
$ gcc --version16.04.5) 5.4.0 20160609gcc (Ubuntu 5.4.0-6ubuntu1
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.`
What did you do?
We have python code to both publish and pull messages from pubsub. We also have python that interfaces with google datastore and google logging. I don't know which one of these codes is triggering this segmentation fault. My code runs fine for a while but after ~60 mins running and processing some cases, a segfault is raised.
I ran my python script inside gdb and this is what I got:
Thread 9 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff054a700 (LWP 29398)]
gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.c:93
93 src/core/lib/support/sync.c: No such file or directory.
(gdb) backtrace
#0 gpr_ref_non_zero (r=0x0) at src/core/lib/support/sync.c:93
#1 0x00007ffff12c8365 in grpc_stream_ref (refcount=) at src/core/lib/transport/transport.c:50
#2 0x00007ffff12f3490 in send_security_metadata (batch=0x7fff8c0820f0, elem=0x7fff8c0821a0, exec_ctx=0x7ffff0549ec0)
at src/core/lib/security/transport/client_auth_filter.c:216
#3 on_host_checked (exec_ctx=exec_ctx@entry=0x7ffff0549ec0, arg=arg@entry=0x7fff8c0820f0, error=)
at src/core/lib/security/transport/client_auth_filter.c:231
#4 0x00007ffff12f396f in auth_start_transport_stream_op_batch (exec_ctx=0x7ffff0549ec0, elem=0x7fff8c0821a0, batch=0x7fff8c0820f0)
at src/core/lib/security/transport/client_auth_filter.c:316
#5 0x00007ffff1300f68 in waiting_for_pick_batches_resume (elem=, elem=, exec_ctx=0x7ffff0549ec0)
at src/core/ext/filters/client_channel/client_channel.c:953
#6 create_subchannel_call_locked (error=0x0, elem=, exec_ctx=0x7ffff0549ec0)
at src/core/ext/filters/client_channel/client_channel.c:1016
#7 pick_done_locked (exec_ctx=0x7ffff0549ec0, elem=, error=0x0) at src/core/ext/filters/client_channel/client_channel.c:1042
#8 0x00007ffff12932f3 in grpc_combiner_continue_exec_ctx (exec_ctx=exec_ctx@entry=0x7ffff0549ec0) at src/core/lib/iomgr/combiner.c:259
#9 0x00007ffff129bdf8 in grpc_exec_ctx_flush (exec_ctx=exec_ctx@entry=0x7ffff0549ec0) at src/core/lib/iomgr/exec_ctx.c:93
#10 0x00007ffff129c3c1 in run_closures (exec_ctx=0x7ffff0549ec0, list=...) at src/core/lib/iomgr/executor.c:81
#11 executor_thread (arg=arg@entry=0x5555565d3e00) at src/core/lib/iomgr/executor.c:181
#12 0x00007ffff1285c37 in thread_body (v=) at src/core/lib/support/thd_posix.c:53
#13 0x00007ffff7bc16ba in start_thread (arg=0x7ffff054a700) at pthread_create.c:333
#14 0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)
What did you expect to see?
No segmentation fault
The text was updated successfully, but these errors were encountered: