Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabitTracker fails to start on IPv6 Spark environment #9118

Closed
dacort opened this issue May 3, 2023 · 8 comments
Closed

RabitTracker fails to start on IPv6 Spark environment #9118

dacort opened this issue May 3, 2023 · 8 comments

Comments

@dacort
Copy link

dacort commented May 3, 2023

When running a pyspark.ml.Pipeline fit on a SparkXGBRegressor, the pyspark code fails with the following error:

OSError: [Errno 99] Cannot assign requested address
Stack trace
Traceback (most recent call last):
  File "/tmp/spark-6c8806ce-ad16-4a32-a70a-979b4627b866/booster-test.py", line 71, in <module>
    pipelineModel = pipeline.fit(train)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 205, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 134, in _fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 205, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 847, in _fit
  File "/home/hadoop/environment/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
  File "/home/hadoop/environment/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 847, in <lambda>
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 351, in wrapped
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 113, in singleTask
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 98, in __next__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 156, in fitSingleModel
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 203, in fit
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/core.py", line 864, in _fit
    (config, booster) = _run_job()
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/core.py", line 860, in _run_job
    .collect()[0]
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1197, in collect
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(13, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/core.py", line 817, in _train_booster
    _rabit_args = _get_rabit_args(context, num_workers)
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/utils.py", line 78, in _get_rabit_args
    env = _start_tracker(context, n_workers)
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/utils.py", line 64, in _start_tracker
    rabit_context = RabitTracker(host_ip=host, n_workers=n_workers)
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/tracker.py", line 209, in __init__
    sock.bind((host_ip, port))

When testing the _get_host_ip in spark/utils.py, it seems that context.getTaskInfos() returns IPv6 addresses when running in a DualStack IPv4/IPv6 network. Additionally, the _get_host_ip function splits on : so it only returns the first octet of the IPv6 address.

I know that there's some partial support for IPv6, but #7725 is still open. I tried fixing _get_host_ip to return the full IPv6 address, but still got the same error.

I was able to get it t work by instead using get_host_ip from xgboost.tracker, which returns IPv4 addresses, but I don't know if that's the "right" approach.

@trivialfis
Copy link
Member

At the moment, only dask is supported for IPv6

@dacort
Copy link
Author

dacort commented May 3, 2023

@trivialfis Any idea if using the get_host_ip from xgboost.tracker is an adequate replacement for _get_host_ip? The latter unfortunately returns IPv6 addresses even if the host has IPv4 addresses, too.

@dacort
Copy link
Author

dacort commented May 3, 2023

Relevant changes staged here: master...dacort:xgboost:fix/spark-hostip

@trivialfis
Copy link
Member

cc @WeichenXu123

@trivialfis
Copy link
Member

Maybe we should add a customization option like what's currently in the dask interface so that users can pick the ip version.

@WeichenXu123
Copy link
Contributor

emm, can we make pyspark xgboost support IPv6 , similar with solution of dask ? I am not familiar with related rabit interface though.

@dacort
Copy link
Author

dacort commented May 5, 2023

That would probably be ideal, tho I'm also unsure how much effort that is. Another short-term option might just be using only IPv4 until v6 support can be added?

@trivialfis
Copy link
Member

I will close this issue in favor of the original IPv6 support feature request here: #7725 . Feel free to continue the discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants