Cloud Spanner with Python multiprocessing hangs #4736

snthibaud · 2018-01-11T08:00:51Z

Inserting rows turned out to be very slow (80 KB/s), so I am trying multiprocessing to parallelize the process.
I am currently using only one node in Cloud Spanner, the rows are sorted by primary key and there are ~30 batches that have a few hundred rows each.
For this I am using the following code:

pool = Pool(5)
pool.starmap(_process_batch, batches)


def _process_batch(instance_name, database_name, table_name, column_names, b):
    spanner_client = spanner.Client(credentials=get_credentials())
    instance = spanner_client.instance(instance_name)
    database = instance.database(database_name)
    with database.batch() as batch:
        batch.insert(table=table_name, columns=tuple(column_names), values=b)

The code hangs when it tries to enter the batch scope (it seems to never succeed in creating a session).
Any ideas?

The text was updated successfully, but these errors were encountered:

vkedia · 2018-01-11T19:50:43Z

cc @jonparrott

theacodes · 2018-01-11T20:04:48Z

@snthibaud can you let us know what versions you're using? pip freeze is helpful.

snthibaud · 2018-01-12T02:02:46Z

@vkedia @jonparrott Thanks for looking into it! These are the versions in my virtual environment:

Flask | 0.12.2 | 0.12.2
-- | -- | --
Jinja2 | 2.9.6 | 2.10
MarkupSafe | 1.0 | 1.0
PyYAML | 3.12 | 3.12
TA-Lib | 0.4.10 | 0.4.14
Werkzeug | 0.12.2 | 0.14.1
amqp | 2.2.2 | 2.2.2
argparse | 1.4.0 | 1.4.0
billiard | 3.5.0.3 | 3.5.0.3
cachetools | 2.0.1 | 2.0.1
celery | 4.1.0 | 4.1.0
certifi | 2017.11.5 | 2017.11.5
certitude | 1.0.1 | 1.0.1
cffi | 1.11.2 | 1.11.2
chardet | 3.0.4 | 3.0.4
click | 6.7 | 6.7
cycler | 0.10.0 |  
dateutils | 0.6.6 | 0.6.6
dill | 0.2.7.1 | 0.2.7.1
facebookads | 2.11.1 | 2.11.1
future | 0.16.0 | 0.16.0
gapic-google-cloud-datastore-v1 | 0.15.3 | 0.90.4
google-api-core | 0.1.1 | 0.1.3
google-api-python-client | 1.6.4 | 1.6.4
google-auth | 1.2.0 | 1.2.1
google-auth-httplib2 | 0.0.2 | 0.0.3
google-cloud-bigquery | 0.28.0 | 0.29.0
google-cloud-core | 0.28.0 | 0.28.0
google-cloud-datastore | 1.4.0 | 1.4.0
google-cloud-spanner | 0.29.0 | 0.29.0
google-cloud-storage | 1.6.0 | 1.6.0
google-gax | 0.15.16 | 0.15.16
google-resumable-media | 0.3.1 | 0.3.1
googleapis-common-protos | 1.5.3 | 1.5.3
grpc-google-iam-v1 | 0.11.4 | 0.11.4
grpcio | 1.8.3 | 1.8.3
html5lib | 0.999999999 | 1.0.1
httplib2 | 0.10.3 | 0.10.3
idna | 2.6 | 2.6
itsdangerous | 0.24 | 0.24
json-table-schema | 0.2.1 | 0.2.1
kombu | 4.1.0 | 4.1.0
lxml | 4.1.0 | 4.1.1
matplotlib | 2.0.2 | 2.1.1
messytables | 0.15.2 | 0.15.2
microsoftbotframework | 0.3.0 | 0.3.0
mysqlclient | 1.3.12 | 1.3.12
numpy | 1.13.1 | 1.14.0
oauth2client | 3.0.0 | 4.1.2
pandas | 0.20.3 | 0.22.0
pip | 9.0.1 | 9.0.1
ply | 3.8 | 3.10
proto-google-cloud-datastore-v1 | 0.90.4 | 0.90.4
protobuf | 3.4.0 | 3.5.1
pubnub | 4.0.13 | 4.0.13
pyasn1 | 0.3.7 | 0.4.2
pyasn1-modules | 0.1.5 | 0.2.1
pycparser | 2.18 | 2.18
pycryptodomex | 3.4.7 | 3.4.7
pyparsing | 2.2.0 | 2.2.0
python-dateutil | 2.6.0 | 2.6.1
python-magic | 0.4.13 | 0.4.15
pytz | 2017.2 | 2017.3
requests | 2.18.4 | 2.18.4
rsa | 3.4.2 | 3.4.2
scikit-learn | 0.19.1 | 0.19.1
scipy | 1.0.0 | 1.0.0
setuptools | 36.6.0 | 38.4.0
six | 1.11.0 | 1.11.0
ticketpy | 1.1.2 | 1.1.2
uritemplate | 3.0.0 | 3.0.0
urllib3 | 1.22 | 1.22
vine | 1.1.4 | 1.1.4
webencodings | 0.5.1 | 0.5.1
wheel | 0.29.0 | 0.30.0
xlrd | 1.1.0 | 1.1.0
xlwt | 1.3.0 | 1.3.0

theacodes · 2018-01-12T21:07:18Z

/cc @nathanielmanistaatgoogle

Are you doing any grpc things or using the client at all before starting the pool? Forking after gRPC has been used apparently causes issues.

FYI: If you're using using multiprocessing to increase write throughput (and not actually doing anything CPU intensive), i'd recommend using threading (as RPC calls are I/O bound and not CPU bound).

theacodes · 2018-01-12T21:09:17Z

@chemelnucfin Since you've been doing a lot of work on spanner lately, can I task you with trying to put together a simple reproducible case for this?

dhermes · 2018-01-12T21:12:19Z

@jonparrott It'd certainly be possible to allow swapping out all threading.Thread and concurrent.futures.TheadPoolExectutor constructor invocations for something more generic (though there is at least one place in a Cython file where a threading.Thread is created). However, the C core will still use the native OS threads (which is pthread on Linux).

theacodes · 2018-01-12T21:15:21Z

@dhermes not sure what you mean?

dhermes · 2018-01-12T21:17:48Z

@jonparrott I was trying to describe how one might implement support in gRPC for users to set the type of concurrency in a transparent way.

theacodes · 2018-01-12T21:23:05Z

Oh - I don't think it's anything in userland. From what I understand this is just a case of trying to use any gRPC-based client after forking - not an issue with any concurrency that our client libraries use.

chemelnucfin · 2018-01-12T21:29:22Z

@jonparrott I'll take a look.

chemelnucfin · 2018-01-12T22:06:16Z

currently getting this:

__________________________ TestSessionAPI.test_multi_processing_transaction ___________________________
Traceback (most recent call last):
  File "\~/projects/google-cloud-python/spanner/tests/system/test_system.py", line 626, in test_multi_processing_transaction
    with self._db.batch() as batch:
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 274, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/usr/lib64/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "\~/projects/google-cloud-python/spanner/.nox/sys-3-6/lib/python3.6/site-packages/google/cloud/client.py", line 138, in __getstate__
    'Clients have non-trivial state that is local and unpickleable.',
_pickle.PicklingError: Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
----------------------------------------- Captured log setup ------------------------------------------
requests.py                117 DEBUG    Making request: POST https://accounts.google.com/o/oauth2/token
connectionpool.py          824 DEBUG    Starting new HTTPS connection (1): accounts.google.com
connectionpool.py          396 DEBUG    https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None
------------------------------------------ Captured log call ------------------------------------------
requests.py                117 DEBUG    Making request: POST https://accounts.google.com/o/oauth2/token
connectionpool.py          824 DEBUG    Starting new HTTPS connection (1): accounts.google.com
connectionpool.py          396 DEBUG    https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None

theacodes · 2018-01-12T22:11:39Z

@chemelnucfin try creating a new client in each process, as in the OP.

chemelnucfin · 2018-01-13T04:36:26Z

@jonparrott So I have chased the multiprocessing to this

============================================== FAILURES ==============================================
_________________________ TestDatabaseAPI.test_multi_processing_transaction __________________________
Traceback (most recent call last):
  File "~/projects/google-cloud-python/spanner/tests/system/test_system.py", line 517, in test_multi_processing_transaction
    pool.starmap(_process, batches)
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 274, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f7aff313278>'. Reason: 'TypeError("can't pickle _thread.RLock objects",)'
----------------------------------------- Captured log setup -----------------------------------------
requests.py                117 DEBUG    Making request: POST https://accounts.google.com/o/oauth2/token
connectionpool.py          824 DEBUG    Starting new HTTPS connection (1): accounts.google.com
connectionpool.py          396 DEBUG    https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None
----------------------------------------- Captured log call ------------------------------------------
requests.py                117 DEBUG    Making request: POST https://accounts.google.com/o/oauth2/token
connectionpool.py          824 DEBUG    Starting new HTTPS connection (1): accounts.google.com
connectionpool.py          396 DEBUG    https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None
======================================== 61 tests deselected =========================================

The culprit is the request_serializer which is here in spanner_v1/proto/spanner_pb2.py:

      self.Commit = channel.unary_unary(
          '/google.spanner.v1.Spanner/Commit',
           request_serializer=CommitRequest.SerializeToString,
           response_deserializer=CommitResponse.FromString
          )

which I have chased to more or less here:
https://github.com/google/protobuf/blob/master/python/google/protobuf/message.py

Any pointers from here will be appreciated.

Also, multiprocessing didn't hang though, snapshots were ok.

snthibaud · 2018-01-15T07:15:45Z

@jonparrott In answer to your question. I am using a different instance of the spanner client before starting the multiprocessing pool. This should be ok, right?
@chemelnucfin I am not completely sure if the service account had the right permissions, so maybe it did hang because of a missing permission although I did try again after changing the permissions and it was still hanging.

chemelnucfin · 2018-01-15T08:51:34Z

I am afk now, but i also believe i had a separate client before using multiprocessing and it did not hang getting a snapshot. Could you try that? Mp did not work with insert though, but it did not hang.

…

On Sun, Jan 14, 2018, 11:15 PM Stéphane Thibaud ***@***.***> wrote: @jonparrott <https://github.com/jonparrott> In answer to your question. I am using a different instance of the spanner client before starting the multiprocessing pool. This should be ok, right? @chemelnucfin <https://github.com/chemelnucfin> I am not completely sure if the service account had the right permissions, so maybe it did hang because of a missing permission although I did try again after changing the permissions and it was still hanging. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4736 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADzDDLYa9DSA0ocqYKzGOw2uV3Uj44uxks5tKvslgaJpZM4Rac7N> .

snthibaud · 2018-01-16T04:48:18Z

@chemelnucfin What exactly should I try? I am not sure if it was related to the retrieval of a snapshot. For me, it was hanging when entering the scope of a batch. Does it retrieve a snapshot at that point?

theacodes · 2018-01-16T16:43:53Z

I am using a different instance of the spanner client before starting the multiprocessing pool. This should be ok, right?

If I remember correctly, that is the core of the issue. grpc's c core can not function after a fork, so any work with grpc must be done after you fork. Correct me if I'm wrong @kpayson64 @nathanielmanistaatgoogle

chemelnucfin · 2018-01-16T17:18:50Z

@jonparrott I have to go back and check, but I believe that that is what I did and multiprocessing worked. I just couldn't use batch.insert function. I'll report back later.

kpayson64 · 2018-01-16T18:15:58Z

On Tue, Jan 16, 2018 at 8:43 AM, Jon Wayne Parrott ***@***.*** > wrote: I am using a different instance of the spanner client before starting the multiprocessing pool. This should be ok, right? If I remember correctly, that is the core of the issue. grpc's c core can not function after a fork, so any work with grpc must be done *after* you fork. Correct me if I'm wrong @kpayson64 <https://github.com/kpayson64> @nathanielmanistaatgoogle <https://github.com/nathanielmanistaatgoogle>

Correct. You should delay client creation until after the fork() call in the process pool.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4736 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARd8Ksjjte7dA4JqkPebRmrHJ83aSDYCks5tLNHOgaJpZM4Rac7N> .

chemelnucfin · 2018-01-17T00:58:46Z

@snthibaud Could you try to pass instead of doing the batch.insert or try to get a snapshot and see where it's hanging?

From what I understand it's once you step into the context manager it's hanging, so the code runs ok without the entire with block?

Could you also try some other operations as I did in #4756 and see if they work?

snthibaud · 2018-01-17T06:56:28Z

@chemelnucfin When putting a pass statement there, it still hangs. I traced the hanging back to line 103 of session.py of the cloud spanner library. It tries to create a session in this line: session_pb = api.create_session(self._database.name, options=options).

snthibaud · 2018-01-17T07:01:37Z

@kpayson64 @jonparrott I understand that in general the client cannot be passed to a new process, but in this case, the main process and all forked processes each have their own client (for the forked processes: see code above). These processes should be completely separated from each other, right? Is the grpc core still somehow shared between the main process and child processes?

snthibaud · 2018-01-17T07:03:45Z

@chemelnucfin This traceback when stopping the process might also help:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/snthibaud/Code/drivemode_batches/job_exports/output/cloud_spanner_target.py", line 100, in _process_batch
    with database.batch() as batch:
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/snthibaud/Code/drivemode_batches/job_exports/output/cloud_spanner_target.py", line 100, in _process_batch
    with database.batch() as batch:
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/database.py", line 401, in __enter__
    session = self._session = self._database._pool.get()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/database.py", line 401, in __enter__
    session = self._session = self._database._pool.get()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/pool.py", line 230, in get
    session.create()
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/pool.py", line 230, in get
    session.create()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/snthibaud/Code/drivemode_batches/job_exports/output/cloud_spanner_target.py", line 100, in _process_batch
    with database.batch() as batch:
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/session.py", line 103, in create
    session_pb = api.create_session(self._database.name, options=options)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/session.py", line 103, in create
    session_pb = api.create_session(self._database.name, options=options)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/database.py", line 401, in __enter__
    session = self._session = self._database._pool.get()
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 330, in create_session
    return self._create_session(request, options)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 330, in create_session
    return self._create_session(request, options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/Users/snthibaud/Code/drivemode_batches/job_exports/output/cloud_spanner_target.py", line 100, in _process_batch
    with database.batch() as batch:
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/pool.py", line 230, in get
    session.create()
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/database.py", line 401, in __enter__
    session = self._session = self._database._pool.get()
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/session.py", line 103, in create
    session_pb = api.create_session(self._database.name, options=options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/pool.py", line 230, in get
    session.create()
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 330, in create_session
    return self._create_session(request, options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/session.py", line 103, in create
    session_pb = api.create_session(self._database.name, options=options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 330, in create_session
    return self._create_session(request, options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 483, in __call__
    credentials)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 483, in __call__
    credentials)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 477, in _blocking
    _handle_event(completion_queue.poll(), state,
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 477, in _blocking
    _handle_event(completion_queue.poll(), state,
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 483, in __call__
    credentials)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 477, in _blocking
    _handle_event(completion_queue.poll(), state,
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
KeyboardInterrupt
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 483, in __call__
    credentials)

snthibaud · 2018-01-17T07:15:55Z

@kpayson64 @jonparrott @chemelnucfin I have tried creating all clients only after forking and that did work. Perhaps the gprc core is shared between the processes after all. I still feel that it's not completely right that subprocesses are hanging in this case.

chemelnucfin · 2018-01-17T07:24:48Z

Glad to know.

@jonparrott Now I'm confused why my code did work.

snthibaud · 2018-01-17T08:33:47Z

@jonparrott @chemelnucfin I am trying to use threads instead of processes (as you suggested) to increase write throughput as follows:

with ThreadPoolExecutor(max_workers=16) as executor:
            for batch_start in range(0, len(data.rows), rows_per_batch):
                executor.submit(_process_batch, database, self.config[INSTANCE], self.config[DATABASE],
                                self.config[TABLE], data_columns,
                                data.rows[batch_start:batch_start+rows_per_batch])

But then I get Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) after a few threads have finished. I am using the same client in all threads. Is that wrong?

theacodes · 2018-01-17T17:34:52Z

@snthibaud

Perhaps the gprc core is shared between the processes after all.

That is my understanding, so using any gRPC client before forking will cause issues in the forked processes.

I am using the same client in all threads. Is that wrong?

Our client should be thread-safe, and in any case, should not cause a segfault. This seems to point to some native code causing issues. @kpayson64 can you help out here?

kpayson64 · 2018-01-17T19:14:35Z

Yes, that looks like a core segfault, but without any additional information its hard to diagnose.

Its possible that you are running into grpc/grpc#13327, which as I understand occurs when there are a bunch of active threads.

vkedia · 2018-01-17T19:38:09Z

I believe @haih-g also ran into this issue while using spanner client across multiple threads.

snthibaud · 2018-01-18T00:02:22Z

@kpayson64 Yes, that's almost certainly it. I am also using Python 3.6 like most users in the issue you mention.

chemelnucfin · 2018-01-19T17:57:45Z

@snthibaud Since I believe it is a different issue now and also reported in a different repository, I will close this issue. Please feel free to reopen or open another issue if desired.

Thanks.

vkedia · 2018-01-19T18:18:35Z

Actually I would like to reopen it. The underlying issue might be in gRPC but this will definitely impact Spanner customers. At the least we should have some documentation and recommendations around how customers can use spanner client safely across processes and across threads.

theacodes · 2018-01-19T18:27:59Z

Agreed, we should have some documentation that mentions explicitly that you can not use this client in a multiprocessing environment unless you defer all calls until after forking.

chemelnucfin · 2018-01-19T18:36:33Z

@jonparrott @vkedia I was going to follow up in the PR or another issue regarding the multiprocessing issues. Sorry!

vkedia · 2018-01-19T18:54:42Z

Thanks. What about multi threading? Does the client not work at all with multi threading or it works under certain conditions?

theacodes · 2018-01-19T19:23:59Z

The client should be thread-safe, if there's any issues with that we should solve them.

vkedia · 2018-01-19T21:32:59Z

But there seems to be some issue with gRPC which is causing segmentation fault in spanner client when used across multiple threads. How do we plan to tackle that?

theacodes · 2018-01-19T21:34:51Z

@vkedia that may already be resolved (see grpc/grpc#13327 (comment)), but we need someone to test. So far it seems it's been difficult to find a reproducible case.

chemelnucfin · 2018-01-26T17:03:01Z

@snthibaud Is the multithreading still segfaulting and multiprocessing still hanging and causing you problems after the fix @jonparrott mentioned?

@kpayson64 I wrote a test in #4756 which creates the pool after client creation, and it does not hang. However, @snthibaud and you mention that it should hang. Did I write something wrong? Thank you.

chemelnucfin · 2018-01-26T21:21:48Z

@snthibaud From this comment here, grpc/grpc#12455 (comment) it seems the problem only occurs on mac and not linux. May I ask if you are using a mac?
Thanks.

srini100 · 2018-02-13T02:42:29Z

Folks, gRPC issue 13327 was recently fixed and verified to address some known segfault cases involving multithreading. This fix in gRPC is related to a bug in internal refcount logic and NOT related to fork(). If anyone on on this thread is able to reproduce this issue, it is worth retrying with gRPC release 1.91. I am curious to know if there is a way to reproduce hang/crash when forking before client creation.

tseaver · 2018-02-21T21:10:04Z

@srini100 FWIW, our system tests for spanner use multithreading, and don't crash.

I just created a multiprocessing testcase which runs cleanly:

$ .nox/sys-3-6/bin/python mp_test.py 
INFO:main:Creating MP pool
INFO:main:Database exists: mp_test
INFO:main:Deleting all items in table: counters
INFO:main:Dispatching to worker processes.
INFO:worker 25167:start
INFO:worker 25168:start
INFO:worker 25169:start
INFO:worker 25170:start
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25169:batch
INFO:worker 25171:batch
INFO:worker 25168:batch
INFO:worker 25170:batch
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:batch
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25169:batch
INFO:worker 25170:batch
INFO:worker 25171:batch
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25168:batch
INFO:worker 25170:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25169:batch
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25168:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25170:batch
INFO:worker 25169:batch
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:batch
INFO:worker 25170:batch
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:batch
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25168:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25170:batch
INFO:worker 25169:batch
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25169:batch
INFO:worker 25170:batch
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:batch
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25170:batch
INFO:worker 25170:done
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25168:batch
INFO:worker 25169:batch
INFO:worker 25168:done
INFO:worker 25169:done
INFO:worker 25171:batch
INFO:worker 25171:done
Result: 25167 250
Result: 25167 250
Result: 25168 250
Result: 25168 250
Result: 25169 250
Result: 25169 250
Result: 25170 250
Result: 25170 250
Result: 25171 250
Result: 25171 250
Result: 25167 250
Result: 25167 250
Result: 25168 250
Result: 25168 250
Result: 25169 250
Result: 25169 250
Result: 25170 250
Result: 25170 250
Result: 25171 250
Result: 25171 250
Result: 25167 250
Result: 25167 250
Result: 25168 250
Result: 25168 250
Result: 25170 250
Result: 25170 250
Result: 25169 250
Result: 25169 250
Result: 25171 250
Result: 25171 250
Result: 25167 250
Result: 25167 250
Result: 25168 250
Result: 25168 250
Result: 25170 250
Result: 25170 250
Result: 25169 250
Result: 25169 250
Result: 25171 250
Result: 25171 250

So, I'm going to close this issue. Feel free to reopen if you can provide a reproducer which hangs / breaks when using multiprocessing in this way.

chemelnucfin · 2018-02-21T21:34:00Z

@tseaver link is dead. Also, just curious, do you have a Mac or Linux?

tseaver · 2018-02-21T21:55:19Z

I've updated the link. Linux.

chemelnucfin added api: spanner Issues related to the Spanner API. type: question Request for information or clarification. Not an issue. labels Jan 11, 2018

chemelnucfin assigned tseaver and chemelnucfin Jan 11, 2018

chemelnucfin mentioned this issue Jan 16, 2018

Test for multiprocessing usage #4756

Closed

chemelnucfin closed this as completed Jan 19, 2018

theacodes reopened this Jan 19, 2018

chemelnucfin added this to To Do in P1 Bugs Jan 31, 2018

tseaver closed this as completed Feb 21, 2018

chemelnucfin moved this from To Do to Done in P1 Bugs Feb 26, 2018

theacodes unassigned chemelnucfin Sep 28, 2018

sebtrack mentioned this issue Oct 27, 2022

Allow celery to spawn rather than fork celery/celery#6036

Open

4 tasks

Cloud Spanner with Python multiprocessing hangs #4736

Cloud Spanner with Python multiprocessing hangs #4736

Comments

snthibaud commented Jan 11, 2018

vkedia commented Jan 11, 2018

theacodes commented Jan 11, 2018

snthibaud commented Jan 12, 2018 • edited

theacodes commented Jan 12, 2018 • edited

theacodes commented Jan 12, 2018

dhermes commented Jan 12, 2018 • edited

theacodes commented Jan 12, 2018

dhermes commented Jan 12, 2018

theacodes commented Jan 12, 2018

chemelnucfin commented Jan 12, 2018

chemelnucfin commented Jan 12, 2018 • edited by theacodes

theacodes commented Jan 12, 2018

chemelnucfin commented Jan 13, 2018 • edited

snthibaud commented Jan 15, 2018

chemelnucfin commented Jan 15, 2018 via email

snthibaud commented Jan 16, 2018

theacodes commented Jan 16, 2018

chemelnucfin commented Jan 16, 2018

kpayson64 commented Jan 16, 2018 via email

chemelnucfin commented Jan 17, 2018

snthibaud commented Jan 17, 2018

snthibaud commented Jan 17, 2018

snthibaud commented Jan 17, 2018

snthibaud commented Jan 17, 2018

chemelnucfin commented Jan 17, 2018

snthibaud commented Jan 17, 2018

theacodes commented Jan 17, 2018

kpayson64 commented Jan 17, 2018 • edited

vkedia commented Jan 17, 2018

snthibaud commented Jan 18, 2018

chemelnucfin commented Jan 19, 2018

vkedia commented Jan 19, 2018

theacodes commented Jan 19, 2018

chemelnucfin commented Jan 19, 2018 • edited

vkedia commented Jan 19, 2018

theacodes commented Jan 19, 2018

vkedia commented Jan 19, 2018

theacodes commented Jan 19, 2018

chemelnucfin commented Jan 26, 2018

chemelnucfin commented Jan 26, 2018

srini100 commented Feb 13, 2018

tseaver commented Feb 21, 2018 • edited

chemelnucfin commented Feb 21, 2018

tseaver commented Feb 21, 2018

snthibaud commented Jan 12, 2018 •

edited

theacodes commented Jan 12, 2018 •

edited

dhermes commented Jan 12, 2018 •

edited

chemelnucfin commented Jan 12, 2018 •

edited by theacodes

chemelnucfin commented Jan 13, 2018 •

edited

kpayson64 commented Jan 17, 2018 •

edited

chemelnucfin commented Jan 19, 2018 •

edited

tseaver commented Feb 21, 2018 •

edited