Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Spanner with Python multiprocessing hangs #4736

Closed
snthibaud opened this issue Jan 11, 2018 · 44 comments
Closed

Cloud Spanner with Python multiprocessing hangs #4736

snthibaud opened this issue Jan 11, 2018 · 44 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. grpc priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects

Comments

@snthibaud
Copy link

Inserting rows turned out to be very slow (80 KB/s), so I am trying multiprocessing to parallelize the process.
I am currently using only one node in Cloud Spanner, the rows are sorted by primary key and there are ~30 batches that have a few hundred rows each.
For this I am using the following code:

pool = Pool(5)
pool.starmap(_process_batch, batches)


def _process_batch(instance_name, database_name, table_name, column_names, b):
    spanner_client = spanner.Client(credentials=get_credentials())
    instance = spanner_client.instance(instance_name)
    database = instance.database(database_name)
    with database.batch() as batch:
        batch.insert(table=table_name, columns=tuple(column_names), values=b)

The code hangs when it tries to enter the batch scope (it seems to never succeed in creating a session).
Any ideas?

@chemelnucfin chemelnucfin added api: spanner Issues related to the Spanner API. type: question Request for information or clarification. Not an issue. labels Jan 11, 2018
@vkedia
Copy link

vkedia commented Jan 11, 2018

cc @jonparrott

@theacodes
Copy link
Contributor

@snthibaud can you let us know what versions you're using? pip freeze is helpful.

@snthibaud
Copy link
Author

snthibaud commented Jan 12, 2018

@vkedia @jonparrott Thanks for looking into it! These are the versions in my virtual environment:

Flask | 0.12.2 | 0.12.2
-- | -- | --
Jinja2 | 2.9.6 | 2.10
MarkupSafe | 1.0 | 1.0
PyYAML | 3.12 | 3.12
TA-Lib | 0.4.10 | 0.4.14
Werkzeug | 0.12.2 | 0.14.1
amqp | 2.2.2 | 2.2.2
argparse | 1.4.0 | 1.4.0
billiard | 3.5.0.3 | 3.5.0.3
cachetools | 2.0.1 | 2.0.1
celery | 4.1.0 | 4.1.0
certifi | 2017.11.5 | 2017.11.5
certitude | 1.0.1 | 1.0.1
cffi | 1.11.2 | 1.11.2
chardet | 3.0.4 | 3.0.4
click | 6.7 | 6.7
cycler | 0.10.0 |  
dateutils | 0.6.6 | 0.6.6
dill | 0.2.7.1 | 0.2.7.1
facebookads | 2.11.1 | 2.11.1
future | 0.16.0 | 0.16.0
gapic-google-cloud-datastore-v1 | 0.15.3 | 0.90.4
google-api-core | 0.1.1 | 0.1.3
google-api-python-client | 1.6.4 | 1.6.4
google-auth | 1.2.0 | 1.2.1
google-auth-httplib2 | 0.0.2 | 0.0.3
google-cloud-bigquery | 0.28.0 | 0.29.0
google-cloud-core | 0.28.0 | 0.28.0
google-cloud-datastore | 1.4.0 | 1.4.0
google-cloud-spanner | 0.29.0 | 0.29.0
google-cloud-storage | 1.6.0 | 1.6.0
google-gax | 0.15.16 | 0.15.16
google-resumable-media | 0.3.1 | 0.3.1
googleapis-common-protos | 1.5.3 | 1.5.3
grpc-google-iam-v1 | 0.11.4 | 0.11.4
grpcio | 1.8.3 | 1.8.3
html5lib | 0.999999999 | 1.0.1
httplib2 | 0.10.3 | 0.10.3
idna | 2.6 | 2.6
itsdangerous | 0.24 | 0.24
json-table-schema | 0.2.1 | 0.2.1
kombu | 4.1.0 | 4.1.0
lxml | 4.1.0 | 4.1.1
matplotlib | 2.0.2 | 2.1.1
messytables | 0.15.2 | 0.15.2
microsoftbotframework | 0.3.0 | 0.3.0
mysqlclient | 1.3.12 | 1.3.12
numpy | 1.13.1 | 1.14.0
oauth2client | 3.0.0 | 4.1.2
pandas | 0.20.3 | 0.22.0
pip | 9.0.1 | 9.0.1
ply | 3.8 | 3.10
proto-google-cloud-datastore-v1 | 0.90.4 | 0.90.4
protobuf | 3.4.0 | 3.5.1
pubnub | 4.0.13 | 4.0.13
pyasn1 | 0.3.7 | 0.4.2
pyasn1-modules | 0.1.5 | 0.2.1
pycparser | 2.18 | 2.18
pycryptodomex | 3.4.7 | 3.4.7
pyparsing | 2.2.0 | 2.2.0
python-dateutil | 2.6.0 | 2.6.1
python-magic | 0.4.13 | 0.4.15
pytz | 2017.2 | 2017.3
requests | 2.18.4 | 2.18.4
rsa | 3.4.2 | 3.4.2
scikit-learn | 0.19.1 | 0.19.1
scipy | 1.0.0 | 1.0.0
setuptools | 36.6.0 | 38.4.0
six | 1.11.0 | 1.11.0
ticketpy | 1.1.2 | 1.1.2
uritemplate | 3.0.0 | 3.0.0
urllib3 | 1.22 | 1.22
vine | 1.1.4 | 1.1.4
webencodings | 0.5.1 | 0.5.1
wheel | 0.29.0 | 0.30.0
xlrd | 1.1.0 | 1.1.0
xlwt | 1.3.0 | 1.3.0

@theacodes
Copy link
Contributor

theacodes commented Jan 12, 2018

/cc @nathanielmanistaatgoogle

Are you doing any grpc things or using the client at all before starting the pool? Forking after gRPC has been used apparently causes issues.

FYI: If you're using using multiprocessing to increase write throughput (and not actually doing anything CPU intensive), i'd recommend using threading (as RPC calls are I/O bound and not CPU bound).

@theacodes theacodes added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. grpc priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed type: question Request for information or clarification. Not an issue. labels Jan 12, 2018
@theacodes
Copy link
Contributor

@chemelnucfin Since you've been doing a lot of work on spanner lately, can I task you with trying to put together a simple reproducible case for this?

@dhermes
Copy link
Contributor

dhermes commented Jan 12, 2018

@jonparrott It'd certainly be possible to allow swapping out all threading.Thread and concurrent.futures.TheadPoolExectutor constructor invocations for something more generic (though there is at least one place in a Cython file where a threading.Thread is created). However, the C core will still use the native OS threads (which is pthread on Linux).

@theacodes
Copy link
Contributor

@dhermes not sure what you mean?

@dhermes
Copy link
Contributor

dhermes commented Jan 12, 2018

@jonparrott I was trying to describe how one might implement support in gRPC for users to set the type of concurrency in a transparent way.

@theacodes
Copy link
Contributor

Oh - I don't think it's anything in userland. From what I understand this is just a case of trying to use any gRPC-based client after forking - not an issue with any concurrency that our client libraries use.

@chemelnucfin
Copy link
Contributor

@jonparrott I'll take a look.

@chemelnucfin
Copy link
Contributor

chemelnucfin commented Jan 12, 2018

currently getting this:

__________________________ TestSessionAPI.test_multi_processing_transaction ___________________________
Traceback (most recent call last):
  File "\~/projects/google-cloud-python/spanner/tests/system/test_system.py", line 626, in test_multi_processing_transaction
    with self._db.batch() as batch:
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 274, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/usr/lib64/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "\~/projects/google-cloud-python/spanner/.nox/sys-3-6/lib/python3.6/site-packages/google/cloud/client.py", line 138, in __getstate__
    'Clients have non-trivial state that is local and unpickleable.',
_pickle.PicklingError: Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
----------------------------------------- Captured log setup ------------------------------------------
requests.py                117 DEBUG    Making request: POST https://accounts.google.com/o/oauth2/token
connectionpool.py          824 DEBUG    Starting new HTTPS connection (1): accounts.google.com
connectionpool.py          396 DEBUG    https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None
------------------------------------------ Captured log call ------------------------------------------
requests.py                117 DEBUG    Making request: POST https://accounts.google.com/o/oauth2/token
connectionpool.py          824 DEBUG    Starting new HTTPS connection (1): accounts.google.com
connectionpool.py          396 DEBUG    https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None

@theacodes
Copy link
Contributor

@chemelnucfin try creating a new client in each process, as in the OP.

@chemelnucfin
Copy link
Contributor

chemelnucfin commented Jan 13, 2018

@jonparrott So I have chased the multiprocessing to this

============================================== FAILURES ==============================================
_________________________ TestDatabaseAPI.test_multi_processing_transaction __________________________
Traceback (most recent call last):
  File "~/projects/google-cloud-python/spanner/tests/system/test_system.py", line 517, in test_multi_processing_transaction
    pool.starmap(_process, batches)
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 274, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f7aff313278>'. Reason: 'TypeError("can't pickle _thread.RLock objects",)'
----------------------------------------- Captured log setup -----------------------------------------
requests.py                117 DEBUG    Making request: POST https://accounts.google.com/o/oauth2/token
connectionpool.py          824 DEBUG    Starting new HTTPS connection (1): accounts.google.com
connectionpool.py          396 DEBUG    https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None
----------------------------------------- Captured log call ------------------------------------------
requests.py                117 DEBUG    Making request: POST https://accounts.google.com/o/oauth2/token
connectionpool.py          824 DEBUG    Starting new HTTPS connection (1): accounts.google.com
connectionpool.py          396 DEBUG    https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None
======================================== 61 tests deselected =========================================

The culprit is the request_serializer which is here in spanner_v1/proto/spanner_pb2.py:

      self.Commit = channel.unary_unary(
          '/google.spanner.v1.Spanner/Commit',
           request_serializer=CommitRequest.SerializeToString,
           response_deserializer=CommitResponse.FromString
          )

which I have chased to more or less here:
https://github.com/google/protobuf/blob/master/python/google/protobuf/message.py

Any pointers from here will be appreciated.

Also, multiprocessing didn't hang though, snapshots were ok.

@snthibaud
Copy link
Author

@jonparrott In answer to your question. I am using a different instance of the spanner client before starting the multiprocessing pool. This should be ok, right?
@chemelnucfin I am not completely sure if the service account had the right permissions, so maybe it did hang because of a missing permission although I did try again after changing the permissions and it was still hanging.

@chemelnucfin
Copy link
Contributor

chemelnucfin commented Jan 15, 2018 via email

@snthibaud
Copy link
Author

@chemelnucfin What exactly should I try? I am not sure if it was related to the retrieval of a snapshot. For me, it was hanging when entering the scope of a batch. Does it retrieve a snapshot at that point?

@theacodes
Copy link
Contributor

I am using a different instance of the spanner client before starting the multiprocessing pool. This should be ok, right?

If I remember correctly, that is the core of the issue. grpc's c core can not function after a fork, so any work with grpc must be done after you fork. Correct me if I'm wrong @kpayson64 @nathanielmanistaatgoogle

@chemelnucfin
Copy link
Contributor

@jonparrott I have to go back and check, but I believe that that is what I did and multiprocessing worked. I just couldn't use batch.insert function. I'll report back later.

@kpayson64
Copy link

kpayson64 commented Jan 16, 2018 via email

@chemelnucfin
Copy link
Contributor

@snthibaud Could you try to pass instead of doing the batch.insert or try to get a snapshot and see where it's hanging?

From what I understand it's once you step into the context manager it's hanging, so the code runs ok without the entire with block?

Could you also try some other operations as I did in #4756 and see if they work?

@snthibaud
Copy link
Author

@chemelnucfin When putting a pass statement there, it still hangs. I traced the hanging back to line 103 of session.py of the cloud spanner library. It tries to create a session in this line: session_pb = api.create_session(self._database.name, options=options).

@snthibaud
Copy link
Author

@kpayson64 @jonparrott I understand that in general the client cannot be passed to a new process, but in this case, the main process and all forked processes each have their own client (for the forked processes: see code above). These processes should be completely separated from each other, right? Is the grpc core still somehow shared between the main process and child processes?

@snthibaud
Copy link
Author

@chemelnucfin This traceback when stopping the process might also help:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/snthibaud/Code/drivemode_batches/job_exports/output/cloud_spanner_target.py", line 100, in _process_batch
    with database.batch() as batch:
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/snthibaud/Code/drivemode_batches/job_exports/output/cloud_spanner_target.py", line 100, in _process_batch
    with database.batch() as batch:
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/database.py", line 401, in __enter__
    session = self._session = self._database._pool.get()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/database.py", line 401, in __enter__
    session = self._session = self._database._pool.get()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/pool.py", line 230, in get
    session.create()
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/pool.py", line 230, in get
    session.create()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/snthibaud/Code/drivemode_batches/job_exports/output/cloud_spanner_target.py", line 100, in _process_batch
    with database.batch() as batch:
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/session.py", line 103, in create
    session_pb = api.create_session(self._database.name, options=options)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/session.py", line 103, in create
    session_pb = api.create_session(self._database.name, options=options)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/database.py", line 401, in __enter__
    session = self._session = self._database._pool.get()
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 330, in create_session
    return self._create_session(request, options)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 330, in create_session
    return self._create_session(request, options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/Users/snthibaud/Code/drivemode_batches/job_exports/output/cloud_spanner_target.py", line 100, in _process_batch
    with database.batch() as batch:
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/pool.py", line 230, in get
    session.create()
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/database.py", line 401, in __enter__
    session = self._session = self._database._pool.get()
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/session.py", line 103, in create
    session_pb = api.create_session(self._database.name, options=options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/pool.py", line 230, in get
    session.create()
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 330, in create_session
    return self._create_session(request, options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/session.py", line 103, in create
    session_pb = api.create_session(self._database.name, options=options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 330, in create_session
    return self._create_session(request, options)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 483, in __call__
    credentials)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 483, in __call__
    credentials)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 477, in _blocking
    _handle_event(completion_queue.poll(), state,
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 477, in _blocking
    _handle_event(completion_queue.poll(), state,
  File "/usr/local/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 483, in __call__
    credentials)
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 477, in _blocking
    _handle_event(completion_queue.poll(), state,
  File "/usr/local/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
KeyboardInterrupt
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 483, in __call__
    credentials)

@snthibaud
Copy link
Author

@kpayson64 @jonparrott @chemelnucfin I have tried creating all clients only after forking and that did work. Perhaps the gprc core is shared between the processes after all. I still feel that it's not completely right that subprocesses are hanging in this case.

@chemelnucfin
Copy link
Contributor

Glad to know.

@jonparrott Now I'm confused why my code did work.

@snthibaud
Copy link
Author

@jonparrott @chemelnucfin I am trying to use threads instead of processes (as you suggested) to increase write throughput as follows:

with ThreadPoolExecutor(max_workers=16) as executor:
            for batch_start in range(0, len(data.rows), rows_per_batch):
                executor.submit(_process_batch, database, self.config[INSTANCE], self.config[DATABASE],
                                self.config[TABLE], data_columns,
                                data.rows[batch_start:batch_start+rows_per_batch])

But then I get Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) after a few threads have finished. I am using the same client in all threads. Is that wrong?

@theacodes
Copy link
Contributor

@snthibaud

Perhaps the gprc core is shared between the processes after all.

That is my understanding, so using any gRPC client before forking will cause issues in the forked processes.

I am using the same client in all threads. Is that wrong?

Our client should be thread-safe, and in any case, should not cause a segfault. This seems to point to some native code causing issues. @kpayson64 can you help out here?

@kpayson64
Copy link

kpayson64 commented Jan 17, 2018

Yes, that looks like a core segfault, but without any additional information its hard to diagnose.

Its possible that you are running into grpc/grpc#13327, which as I understand occurs when there are a bunch of active threads.

@vkedia
Copy link

vkedia commented Jan 17, 2018

I believe @haih-g also ran into this issue while using spanner client across multiple threads.

@snthibaud
Copy link
Author

@kpayson64 Yes, that's almost certainly it. I am also using Python 3.6 like most users in the issue you mention.

@chemelnucfin
Copy link
Contributor

@snthibaud Since I believe it is a different issue now and also reported in a different repository, I will close this issue. Please feel free to reopen or open another issue if desired.

Thanks.

@vkedia
Copy link

vkedia commented Jan 19, 2018

Actually I would like to reopen it. The underlying issue might be in gRPC but this will definitely impact Spanner customers. At the least we should have some documentation and recommendations around how customers can use spanner client safely across processes and across threads.

@theacodes theacodes reopened this Jan 19, 2018
@theacodes
Copy link
Contributor

Agreed, we should have some documentation that mentions explicitly that you can not use this client in a multiprocessing environment unless you defer all calls until after forking.

@chemelnucfin
Copy link
Contributor

chemelnucfin commented Jan 19, 2018

@jonparrott @vkedia I was going to follow up in the PR or another issue regarding the multiprocessing issues. Sorry!

@vkedia
Copy link

vkedia commented Jan 19, 2018

Thanks. What about multi threading? Does the client not work at all with multi threading or it works under certain conditions?

@theacodes
Copy link
Contributor

The client should be thread-safe, if there's any issues with that we should solve them.

@vkedia
Copy link

vkedia commented Jan 19, 2018

But there seems to be some issue with gRPC which is causing segmentation fault in spanner client when used across multiple threads. How do we plan to tackle that?

@theacodes
Copy link
Contributor

@vkedia that may already be resolved (see grpc/grpc#13327 (comment)), but we need someone to test. So far it seems it's been difficult to find a reproducible case.

@chemelnucfin
Copy link
Contributor

@snthibaud Is the multithreading still segfaulting and multiprocessing still hanging and causing you problems after the fix @jonparrott mentioned?

@kpayson64 I wrote a test in #4756 which creates the pool after client creation, and it does not hang. However, @snthibaud and you mention that it should hang. Did I write something wrong? Thank you.

@chemelnucfin
Copy link
Contributor

@snthibaud From this comment here, grpc/grpc#12455 (comment) it seems the problem only occurs on mac and not linux. May I ask if you are using a mac?
Thanks.

@chemelnucfin chemelnucfin added this to To Do in P1 Bugs Jan 31, 2018
@srini100
Copy link

Folks, gRPC issue 13327 was recently fixed and verified to address some known segfault cases involving multithreading. This fix in gRPC is related to a bug in internal refcount logic and NOT related to fork(). If anyone on on this thread is able to reproduce this issue, it is worth retrying with gRPC release 1.91. I am curious to know if there is a way to reproduce hang/crash when forking before client creation.

@tseaver
Copy link
Contributor

tseaver commented Feb 21, 2018

@srini100 FWIW, our system tests for spanner use multithreading, and don't crash.

I just created a multiprocessing testcase which runs cleanly:

$ .nox/sys-3-6/bin/python mp_test.py 
INFO:main:Creating MP pool
INFO:main:Database exists: mp_test
INFO:main:Deleting all items in table: counters
INFO:main:Dispatching to worker processes.
INFO:worker 25167:start
INFO:worker 25168:start
INFO:worker 25169:start
INFO:worker 25170:start
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25169:batch
INFO:worker 25171:batch
INFO:worker 25168:batch
INFO:worker 25170:batch
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:batch
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25169:batch
INFO:worker 25170:batch
INFO:worker 25171:batch
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25168:batch
INFO:worker 25170:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25169:batch
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25168:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25170:batch
INFO:worker 25169:batch
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:batch
INFO:worker 25170:batch
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:batch
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25167:batch
INFO:worker 25168:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25170:batch
INFO:worker 25169:batch
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25169:batch
INFO:worker 25170:batch
INFO:worker 25170:done
INFO:worker 25170:start
INFO:worker 25169:done
INFO:worker 25169:start
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25167:start
INFO:worker 25168:batch
INFO:worker 25168:done
INFO:worker 25168:start
INFO:worker 25171:batch
INFO:worker 25171:done
INFO:worker 25171:start
INFO:worker 25170:batch
INFO:worker 25170:done
INFO:worker 25167:batch
INFO:worker 25167:done
INFO:worker 25168:batch
INFO:worker 25169:batch
INFO:worker 25168:done
INFO:worker 25169:done
INFO:worker 25171:batch
INFO:worker 25171:done
Result: 25167 250
Result: 25167 250
Result: 25168 250
Result: 25168 250
Result: 25169 250
Result: 25169 250
Result: 25170 250
Result: 25170 250
Result: 25171 250
Result: 25171 250
Result: 25167 250
Result: 25167 250
Result: 25168 250
Result: 25168 250
Result: 25169 250
Result: 25169 250
Result: 25170 250
Result: 25170 250
Result: 25171 250
Result: 25171 250
Result: 25167 250
Result: 25167 250
Result: 25168 250
Result: 25168 250
Result: 25170 250
Result: 25170 250
Result: 25169 250
Result: 25169 250
Result: 25171 250
Result: 25171 250
Result: 25167 250
Result: 25167 250
Result: 25168 250
Result: 25168 250
Result: 25170 250
Result: 25170 250
Result: 25169 250
Result: 25169 250
Result: 25171 250
Result: 25171 250

So, I'm going to close this issue. Feel free to reopen if you can provide a reproducer which hangs / breaks when using multiprocessing in this way.

@tseaver tseaver closed this as completed Feb 21, 2018
@chemelnucfin
Copy link
Contributor

@tseaver link is dead. Also, just curious, do you have a Mac or Linux?

@tseaver
Copy link
Contributor

tseaver commented Feb 21, 2018

I've updated the link. Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. grpc priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
No open projects
P1 Bugs
  
Done
Development

No branches or pull requests

8 participants