[SPARK-22640][PYSPARK][YARN]switch python exec on executor side #19840

yaooqinn · 2017-11-29T03:34:33Z

What changes were proposed in this pull request?

PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python \
bin/spark-submit --master yarn --deploy-mode client \ 
--archives ~/anaconda3/envs/py3.zip \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python  \
/home/hadoop/data/apache-spark/spark-2.1.2-bin-hadoop2.7/examples/src/main/python/mllib/correlations_example.py

In the case above, I created a python environment, delivered it via --arichives, then visited it on Executor Node via spark.executorEnv.PYSPARK_PYTHON.
But Executor seemed to use PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python instead of spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python, then application end with ioe.

this pr aim to switch the python exec when user specifies it.

How was this patch tested?

manually verified with the case above.

SparkQA · 2017-11-29T06:50:42Z

Test build #84280 has finished for PR 19840 at commit 8ff5663.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-11-30T09:55:33Z

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

  // All the Python functions should have the same exec, version and envvars.
  protected val envVars = funcs.head.funcs.head.envVars
-  protected val pythonExec = funcs.head.funcs.head.pythonExec
+  protected val pythonExec = conf.getOption("spark.executorEnv.PYSPARK_DRIVER_PYTHON")


Shouldn't this be handled in deploy/PythonRunner?

Yeah, PYSPARK_DRIVER_PYTHON should be handled in deploy/PythonRunner.
This PythonRunner is for executors.

jiangxb1987 · 2017-11-30T09:56:09Z

cc @cloud-fan @ueshin

ueshin · 2017-11-30T10:41:07Z

Should we set the pythonExec during the initialization of SparkContext at context.py#L191 instead of in PythonRunner? @jiangxb1987 @cloud-fan

Btw, just for confirmation, is spark.yarn.appMasterEnv.PYSPARK_PYTHON working? @yaooqinn

yaooqinn · 2017-12-01T05:29:04Z

It happens when specifing PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python before bin/spark-submit, such as

PYSPARK_PYTHON=/path/to/a/client/side/bin/python \
bin/spark-submit --master yarn --deploy-mode client \ 
--archives /path/to/a/client/side/python.zip \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python.zip/somedir/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=python.zip/somedir/bin/python  \
exec.py

whether yarn client or cluster, it fails

 Lost task 0.0 in stage 0.0 (TID 0, hzadg-hadoop-dev1.server.163.org, executor 2): java.io.IOException: Cannot run program "/home/hadoop/anaconda2/envs/py2env_take2/bin/python": error=2, No such file or directory

While not specifying PYSPARK_PYTHON=/path/to/a/client/side/bin/python \

 bin/spark-submit --master yarn --deploy-mode client \ 
--archives /path/to/a/client/side/python.zip \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python.zip/somedir/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=python.zip/somedir/bin/python  \
exec.py

For cluster mode it succeeded, for client mode it fails because no module implemented for the system default python on the client node.

jerryshao · 2017-12-01T06:27:00Z

I think in YARN we have several different ways to set PYSPARK_PYTHON, I guess your issue is that which one should take priority?

Can you please:

Define a consistent ordering for such envs, which one should take priority (spark.yarn.appMasterEnv.XXX or XXX), and document it.
Check if it works as expected for spark.yarn.appMasterEnv.PYSPARK_PYTHON and spark.executorEnv.PYSPARK_PYTHON.

I guess for other envs, it will also suffer from this problem, am I right?

yaooqinn · 2017-12-01T07:02:25Z

@jerryshao ENVs are specified ok by yarn, but the pythonExec is generated in PythonRDD at driver side and delivered to executor side within a closure

jerryshao · 2017-12-01T07:15:10Z

Oh, I see. You're running in client mode. So this one --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python is useless.

So I guess the behavior is expected. Because driver will honor PYSPARK_PYTHON env and ship it to executors. So the cluster will use same python executables.

With your above test, /path/to/python is now different for driver and executors, will it bring in issues? Driver uses PYSPARK_PYTHON and executors uses spark.executorEnv.PYSPARK_PYTHON, which point to different paths.

Normally I think we guarantee that driver and executors honor same python path, and executables exist in the whole cluster. So we don't need to set PYSPARK_PYTHON in executor side.

Please correct me if I'm wrong.

yaooqinn · 2017-12-01T07:39:36Z

Yes, you are right. we should use same python executables. But I guess the same might mean binary same not just same path. For client mode, the driver is user side environment and flexible, but the default cluster setting might be opaque for users. Since we can use --archives to deliver whole python executables for executor side use, why not using it

jerryshao · 2017-12-01T07:56:24Z

I'm a little concerned about such changes, this may be misconfigured to introduce the discrepancy between driver python and executor python, at least we should honor this configuration "spark.executorEnv.PYSPARK_PYTHON" unless the executables of PYSPARK_PYTHON sent by driver is not existed. (Just my two cents)

ping @zjffdu , any thought?

vanzin · 2017-12-01T19:39:30Z

Instead of setting PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python, what happens if you set PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/py3/bin/python?

(Maybe it doesn't change anything because of PythonFunction.pythonExec, but doesn't hurt to check.)

yaooqinn · 2017-12-04T03:57:16Z

@vanzin PYSPARK_DRIVER_PYTHON won't work because context.py#L191 does't deal with it

yaooqinn · 2017-12-04T05:00:48Z

use spark-2.2.0-bin-hadoop2.7 numpy examples/src/main/python/mllib/correlations_example.py

case 1

key	value
PYSPARK_DRIVER_PYTHON	~/anaconda3/envs/py3/bin/python
deploy-mode	client
--archives	~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON	py3.zip/py3/bin/python
spark.executorEnv.PYSPARK_PYTHON	py3.zip/py3/bin/python
failure	Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

case 2

key	value
PYSPARK_DRIVER_PYTHON	~/anaconda3/envs/py3/bin/python
deploy-mode	cluster
--archives	~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON	py3.zip/py3/bin/python
spark.executorEnv.PYSPARK_PYTHON	py3.zip/py3/bin/python
failure	java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:91)

case 3 & 4

key	value
PYSPARK_DRIVER_PYTHON	~/anaconda3/envs/py3/bin/python
deploy-mode	cluster(3) client (4)
--archives	~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON	py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON	py3.zip/py3/bin/python
failure	Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

case 5 && 6

key	value
PYSPARK_PYTHON	~/anaconda3/envs/py3/bin/python
deploy-mode	client(5) cluster(6)
--archives	~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON	py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON	py3.zip/py3/bin/python
failure	java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory [executor side PythonRunner]

case 7

key	value
PYSPARK_PYTHON	~/anaconda3/envs/py3/bin/python
deploy-mode	cluster
--archives	~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON	py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON	py3.zip/py3/bin/python
success	--

case 8

key	value
PYSPARK_PYTHON	~/anaconda3/envs/py3/bin/python
deploy-mode	client
--archives	~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON	py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON	py3.zip/py3/bin/python
failure	java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory [executor side PythonRunner]

case 9

key	value
not setting~~PYSPARK_[DRIVER]_PYTHON~~
deploy-mode	client
--archives	~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON	py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON	py3.zip/py3/bin/python
failure	ImportError: No module named numpy

case 10

key	value
not setting~~PYSPARK_[DRIVER]_PYTHON~~
deploy-mode	cluster
--archives	~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON	py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON	py3.zip/py3/bin/python
success	--

my humble opinions

spark.executorEnv. PYSPARK_* takes no affect on executor side pythonExec, which is determined by driver.
if PYSPARK_PYTHON is specified then spark.yarn.appMasterEnv. should be suffixed by PYSPARK_PYTHON not ~~PYSPARK_DRIVER_PYTHON~~. This may be a priority problem.
specifying PYSPARK_DRIVER_PYTHON fails all the cases, it may be caused by https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L191 only deal with PYSPARK_PYTHON, this should be fix too
4、yarn client mode fails all the cases above, which means that I have to implement my client side python manually to all cluster nodes to ensure them to be all the same.

PS: Why there are two python home variables -- PYSPARK_DRIVER_PYTHON & PYSPARK_PYTHON?

ueshin · 2017-12-04T06:48:15Z

@yaooqinn What's the difference between case 7 and 8? Looks like the same configuration but the different result?

yaooqinn · 2017-12-04T06:52:41Z

@ueshin case 8 should be client deploy mode, excuse me for copy mistake, fixed

ueshin · 2017-12-04T08:07:32Z

@yaooqinn OK, I see the situation.

In client mode, I think we can't use spark.yarn.appMasterEnv.XXX which is for cluster mode. So we should use environment variable PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON, or corresponding spark conf, spark.pyspark.python, spark.pyspark.driver.python.

In cluster mode, we can use spark.yarn.appMasterEnv.XXX and if there exist spark.yarn.appMasterEnv.PYSPARK_PYTHON or spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON, they overwrite original environment variables.

Btw, PYSPARK_DRIVER_PYTHON is for only Driver, not Executors, so we should handle only PYSPARK_PYTHON in executor and the priority of PYSPARK_DRIVER_PYTHON is higher than PYSPARK_PYTHON in Driver.

Currently we handle only environment varibale but not spark.executorEnv.PYSPARK_PYTHON for executor so we should handle it at api/python/PythonRunner as you do now or context.py#L191.

yaooqinn · 2017-12-04T08:24:44Z

@ueshin I can spark.executorEnv.PYSPARK_PYTHON in sparkConf at executor side , because it is set at context.py#L156 by conf.py#L153

ueshin · 2017-12-04T08:28:35Z

@yaooqinn I meant it is not used for pythonExec.

yaooqinn · 2017-12-04T08:45:09Z

@ueshin i see.

yaooqinn · 2017-12-04T09:32:01Z

@ueshin context.py#L191 set for both driver and executor?

ueshin · 2017-12-04T10:04:11Z

@yaooqinn It is used for executors.

vanzin · 2017-12-04T17:41:26Z

I'm trying to understand what is https://github.com/apache/spark/blob/master/python/pyspark/context.py#L191 really achieving. It seems pretty broken to me and feels like the whole pythonExec tracking in the various places should be removed.

It causes this problem because it forces the executor to use the driver's python even if it's been set to a different path by the user.

It uses python instead of sys.executable as the default value.

And it ignores the spark.pyspark.python config value if it's set.

Instead, shouldn't the logic at https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L304 be used in PythonRunner (except for the driver python config) to find out the executor's python to use?

yaooqinn · 2017-12-05T07:09:38Z

@vanzin
according to @ueshin 's explanation, PYSPARK_DRIVER_PYTHON is only for driver, if executor follows the order of SparkSubmitCommandBuilder.java#L304, we may not need so many configs, right?

vanzin · 2017-12-05T17:06:37Z

That's what I said in my comment ("except for the driver python config").

vanzin · 2017-12-12T00:33:44Z

So, any updates here?

vanzin · 2018-05-14T22:48:58Z

@yaooqinn do you plan to update this PR?

yaooqinn · 2018-05-15T01:58:09Z

@vanzin I am not very familiar with python part context.py#L191, so handle it at api/python/PythonRunner as I did in this pr.

Maybe someone else could help, sorry for the delay.

AmplabJenkins · 2018-10-22T16:36:50Z

Build finished. Test FAILed.

tgravescs · 2018-11-20T18:52:10Z

I didn't read the entire thread here but what you want is this:

--archives hdfs:///python36/python36.tgz#python36 --conf spark.pyspark.python=./python36/bin/python3.6 --conf spark.executorEnv.LD_LIBRARY_PATH=./python36/lib --driver-library-path /opt/python36/lib --conf spark.pyspark.driver.python=/opt/python36/bin/python3.6

vanzin · 2018-12-10T23:56:55Z

@yaooqinn we should probably close this if you're not planning to look at the root of the problem, which seems to be on the python side.

switch python exec in executor side

8ff5663

yaooqinn changed the title ~~[SPARK-22640][PYSPARK][YARN]switch python exec in executor side~~ [SPARK-22640][PYSPARK][YARN]switch python exec on executor side Nov 29, 2017

jiangxb1987 reviewed Nov 30, 2017

View reviewed changes

yaooqinn closed this Dec 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22640][PYSPARK][YARN]switch python exec on executor side #19840

[SPARK-22640][PYSPARK][YARN]switch python exec on executor side #19840

yaooqinn commented Nov 29, 2017

SparkQA commented Nov 29, 2017

jiangxb1987 Nov 30, 2017

ueshin Nov 30, 2017

jiangxb1987 commented Nov 30, 2017

ueshin commented Nov 30, 2017

yaooqinn commented Dec 1, 2017 •

edited

jerryshao commented Dec 1, 2017

yaooqinn commented Dec 1, 2017

jerryshao commented Dec 1, 2017 •

edited

yaooqinn commented Dec 1, 2017 •

edited

jerryshao commented Dec 1, 2017

vanzin commented Dec 1, 2017 •

edited

yaooqinn commented Dec 4, 2017

yaooqinn commented Dec 4, 2017 •

edited

ueshin commented Dec 4, 2017

yaooqinn commented Dec 4, 2017

ueshin commented Dec 4, 2017

yaooqinn commented Dec 4, 2017 •

edited

ueshin commented Dec 4, 2017

yaooqinn commented Dec 4, 2017

yaooqinn commented Dec 4, 2017

ueshin commented Dec 4, 2017

vanzin commented Dec 4, 2017

yaooqinn commented Dec 5, 2017

vanzin commented Dec 5, 2017

vanzin commented Dec 12, 2017

vanzin commented May 14, 2018

yaooqinn commented May 15, 2018

AmplabJenkins commented Oct 22, 2018

tgravescs commented Nov 20, 2018

vanzin commented Dec 10, 2018

[SPARK-22640][PYSPARK][YARN]switch python exec on executor side #19840

[SPARK-22640][PYSPARK][YARN]switch python exec on executor side #19840

Conversation

yaooqinn commented Nov 29, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 29, 2017

jiangxb1987 Nov 30, 2017

Choose a reason for hiding this comment

ueshin Nov 30, 2017

Choose a reason for hiding this comment

jiangxb1987 commented Nov 30, 2017

ueshin commented Nov 30, 2017

yaooqinn commented Dec 1, 2017 • edited

jerryshao commented Dec 1, 2017

yaooqinn commented Dec 1, 2017

jerryshao commented Dec 1, 2017 • edited

yaooqinn commented Dec 1, 2017 • edited

jerryshao commented Dec 1, 2017

vanzin commented Dec 1, 2017 • edited

yaooqinn commented Dec 4, 2017

yaooqinn commented Dec 4, 2017 • edited

use spark-2.2.0-bin-hadoop2.7 numpy examples/src/main/python/mllib/correlations_example.py

case 1

case 2

case 3 & 4

case 5 && 6

case 7

case 8

case 9

case 10

my humble opinions

ueshin commented Dec 4, 2017

yaooqinn commented Dec 4, 2017

ueshin commented Dec 4, 2017

yaooqinn commented Dec 4, 2017 • edited

ueshin commented Dec 4, 2017

yaooqinn commented Dec 4, 2017

yaooqinn commented Dec 4, 2017

ueshin commented Dec 4, 2017

vanzin commented Dec 4, 2017

yaooqinn commented Dec 5, 2017

vanzin commented Dec 5, 2017

vanzin commented Dec 12, 2017

vanzin commented May 14, 2018

yaooqinn commented May 15, 2018

AmplabJenkins commented Oct 22, 2018

tgravescs commented Nov 20, 2018

vanzin commented Dec 10, 2018

yaooqinn commented Dec 1, 2017 •

edited

jerryshao commented Dec 1, 2017 •

edited

yaooqinn commented Dec 1, 2017 •

edited

vanzin commented Dec 1, 2017 •

edited

yaooqinn commented Dec 4, 2017 •

edited

yaooqinn commented Dec 4, 2017 •

edited