RabitTracker fails to start on IPv6 Spark environment #9118

dacort · 2023-05-03T18:13:57Z

When running a pyspark.ml.Pipeline fit on a SparkXGBRegressor, the pyspark code fails with the following error:

OSError: [Errno 99] Cannot assign requested address

Stack trace

Traceback (most recent call last):
  File "/tmp/spark-6c8806ce-ad16-4a32-a70a-979b4627b866/booster-test.py", line 71, in <module>
    pipelineModel = pipeline.fit(train)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 205, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 134, in _fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 205, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 847, in _fit
  File "/home/hadoop/environment/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
  File "/home/hadoop/environment/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 847, in <lambda>
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 351, in wrapped
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 113, in singleTask
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 98, in __next__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 156, in fitSingleModel
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 203, in fit
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/core.py", line 864, in _fit
    (config, booster) = _run_job()
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/core.py", line 860, in _run_job
    .collect()[0]
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1197, in collect
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(13, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/core.py", line 817, in _train_booster
    _rabit_args = _get_rabit_args(context, num_workers)
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/utils.py", line 78, in _get_rabit_args
    env = _start_tracker(context, n_workers)
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/spark/utils.py", line 64, in _start_tracker
    rabit_context = RabitTracker(host_ip=host, n_workers=n_workers)
  File "/home/hadoop/environment/lib/python3.10/site-packages/xgboost/tracker.py", line 209, in __init__
    sock.bind((host_ip, port))

When testing the _get_host_ip in spark/utils.py, it seems that context.getTaskInfos() returns IPv6 addresses when running in a DualStack IPv4/IPv6 network. Additionally, the _get_host_ip function splits on : so it only returns the first octet of the IPv6 address.

I know that there's some partial support for IPv6, but #7725 is still open. I tried fixing _get_host_ip to return the full IPv6 address, but still got the same error.

I was able to get it t work by instead using get_host_ip from xgboost.tracker, which returns IPv4 addresses, but I don't know if that's the "right" approach.

The text was updated successfully, but these errors were encountered:

trivialfis · 2023-05-03T18:32:14Z

At the moment, only dask is supported for IPv6

dacort · 2023-05-03T18:46:23Z

@trivialfis Any idea if using the get_host_ip from xgboost.tracker is an adequate replacement for _get_host_ip? The latter unfortunately returns IPv6 addresses even if the host has IPv4 addresses, too.

dacort · 2023-05-03T20:40:06Z

Relevant changes staged here: master...dacort:xgboost:fix/spark-hostip

trivialfis · 2023-05-04T14:31:59Z

cc @WeichenXu123

trivialfis · 2023-05-04T15:01:42Z

Maybe we should add a customization option like what's currently in the dask interface so that users can pick the ip version.

WeichenXu123 · 2023-05-05T04:54:43Z

emm, can we make pyspark xgboost support IPv6 , similar with solution of dask ? I am not familiar with related rabit interface though.

dacort · 2023-05-05T05:12:45Z

That would probably be ideal, tho I'm also unsure how much effort that is. Another short-term option might just be using only IPv4 until v6 support can be added?

trivialfis · 2023-08-09T08:01:20Z

I will close this issue in favor of the original IPv6 support feature request here: #7725 . Feel free to continue the discussion there.

NvTimLiu mentioned this issue May 9, 2023

Failed on NGC due to TrackerConf() interface updated. #9145

Closed

trivialfis added the feature-request label Aug 9, 2023

trivialfis closed this as completed Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabitTracker fails to start on IPv6 Spark environment #9118

RabitTracker fails to start on IPv6 Spark environment #9118

dacort commented May 3, 2023

trivialfis commented May 3, 2023

dacort commented May 3, 2023

dacort commented May 3, 2023

trivialfis commented May 4, 2023

trivialfis commented May 4, 2023

WeichenXu123 commented May 5, 2023

dacort commented May 5, 2023

trivialfis commented Aug 9, 2023

RabitTracker fails to start on IPv6 Spark environment #9118

RabitTracker fails to start on IPv6 Spark environment #9118

Comments

dacort commented May 3, 2023

trivialfis commented May 3, 2023

dacort commented May 3, 2023

dacort commented May 3, 2023

trivialfis commented May 4, 2023

trivialfis commented May 4, 2023

WeichenXu123 commented May 5, 2023

dacort commented May 5, 2023

trivialfis commented Aug 9, 2023