Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22640][PYSPARK][YARN]switch python exec on executor side #19840

Closed
wants to merge 1 commit into from
Closed

[SPARK-22640][PYSPARK][YARN]switch python exec on executor side #19840

wants to merge 1 commit into from

Conversation

yaooqinn
Copy link
Member

What changes were proposed in this pull request?

PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python \
bin/spark-submit --master yarn --deploy-mode client \ 
--archives ~/anaconda3/envs/py3.zip \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python  \
/home/hadoop/data/apache-spark/spark-2.1.2-bin-hadoop2.7/examples/src/main/python/mllib/correlations_example.py

In the case above, I created a python environment, delivered it via --arichives, then visited it on Executor Node via spark.executorEnv.PYSPARK_PYTHON.
But Executor seemed to use PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python instead of spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python, then application end with ioe.

this pr aim to switch the python exec when user specifies it.

How was this patch tested?

manually verified with the case above.

@yaooqinn yaooqinn changed the title [SPARK-22640][PYSPARK][YARN]switch python exec in executor side [SPARK-22640][PYSPARK][YARN]switch python exec on executor side Nov 29, 2017
@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84280 has finished for PR 19840 at commit 8ff5663.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// All the Python functions should have the same exec, version and envvars.
protected val envVars = funcs.head.funcs.head.envVars
protected val pythonExec = funcs.head.funcs.head.pythonExec
protected val pythonExec = conf.getOption("spark.executorEnv.PYSPARK_DRIVER_PYTHON")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be handled in deploy/PythonRunner?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, PYSPARK_DRIVER_PYTHON should be handled in deploy/PythonRunner.
This PythonRunner is for executors.

@jiangxb1987
Copy link
Contributor

cc @cloud-fan @ueshin

@ueshin
Copy link
Member

ueshin commented Nov 30, 2017

Should we set the pythonExec during the initialization of SparkContext at context.py#L191 instead of in PythonRunner? @jiangxb1987 @cloud-fan

Btw, just for confirmation, is spark.yarn.appMasterEnv.PYSPARK_PYTHON working? @yaooqinn

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 1, 2017

It happens when specifing PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python before bin/spark-submit, such as

PYSPARK_PYTHON=/path/to/a/client/side/bin/python \
bin/spark-submit --master yarn --deploy-mode client \ 
--archives /path/to/a/client/side/python.zip \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python.zip/somedir/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=python.zip/somedir/bin/python  \
exec.py

whether yarn client or cluster, it fails

 Lost task 0.0 in stage 0.0 (TID 0, hzadg-hadoop-dev1.server.163.org, executor 2): java.io.IOException: Cannot run program "/home/hadoop/anaconda2/envs/py2env_take2/bin/python": error=2, No such file or directory


While not specifying PYSPARK_PYTHON=/path/to/a/client/side/bin/python \

 bin/spark-submit --master yarn --deploy-mode client \ 
--archives /path/to/a/client/side/python.zip \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python.zip/somedir/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=python.zip/somedir/bin/python  \
exec.py

For cluster mode it succeeded, for client mode it fails because no module implemented for the system default python on the client node.

@jerryshao
Copy link
Contributor

I think in YARN we have several different ways to set PYSPARK_PYTHON, I guess your issue is that which one should take priority?

Can you please:

  1. Define a consistent ordering for such envs, which one should take priority (spark.yarn.appMasterEnv.XXX or XXX), and document it.
  2. Check if it works as expected for spark.yarn.appMasterEnv.PYSPARK_PYTHON and spark.executorEnv.PYSPARK_PYTHON.

I guess for other envs, it will also suffer from this problem, am I right?

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 1, 2017

wx20171201-145631 2x

@jerryshao ENVs are specified ok by yarn, but the pythonExec is generated in PythonRDD at driver side and delivered to executor side within a closure

@jerryshao
Copy link
Contributor

jerryshao commented Dec 1, 2017

Oh, I see. You're running in client mode. So this one --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python is useless.

So I guess the behavior is expected. Because driver will honor PYSPARK_PYTHON env and ship it to executors. So the cluster will use same python executables.

With your above test, /path/to/python is now different for driver and executors, will it bring in issues? Driver uses PYSPARK_PYTHON and executors uses spark.executorEnv.PYSPARK_PYTHON, which point to different paths.

Normally I think we guarantee that driver and executors honor same python path, and executables exist in the whole cluster. So we don't need to set PYSPARK_PYTHON in executor side.

Please correct me if I'm wrong.

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 1, 2017

Yes, you are right. we should use same python executables. But I guess the same might mean binary same not just same path. For client mode, the driver is user side environment and flexible, but the default cluster setting might be opaque for users. Since we can use --archives to deliver whole python executables for executor side use, why not using it

@jerryshao
Copy link
Contributor

I'm a little concerned about such changes, this may be misconfigured to introduce the discrepancy between driver python and executor python, at least we should honor this configuration "spark.executorEnv.PYSPARK_PYTHON" unless the executables of PYSPARK_PYTHON sent by driver is not existed. (Just my two cents)

ping @zjffdu , any thought?

@vanzin
Copy link
Contributor

vanzin commented Dec 1, 2017

Instead of setting PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python, what happens if you set PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/py3/bin/python?

(Maybe it doesn't change anything because of PythonFunction.pythonExec, but doesn't hurt to check.)

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 4, 2017

@vanzin PYSPARK_DRIVER_PYTHON won't work because context.py#L191 does't deal with it

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 4, 2017

use spark-2.2.0-bin-hadoop2.7 numpy examples/src/main/python/mllib/correlations_example.py

case 1

key value
PYSPARK_DRIVER_PYTHON ~/anaconda3/envs/py3/bin/python
deploy-mode client
--archives ~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON py3.zip/py3/bin/python
spark.executorEnv.PYSPARK_PYTHON py3.zip/py3/bin/python
failure Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

case 2

key value
PYSPARK_DRIVER_PYTHON ~/anaconda3/envs/py3/bin/python
deploy-mode cluster
--archives ~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON py3.zip/py3/bin/python
spark.executorEnv.PYSPARK_PYTHON py3.zip/py3/bin/python
failure java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:91)

case 3 & 4

key value
PYSPARK_DRIVER_PYTHON ~/anaconda3/envs/py3/bin/python
deploy-mode cluster(3) client (4)
--archives ~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON py3.zip/py3/bin/python
failure Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

case 5 && 6

key value
PYSPARK_PYTHON ~/anaconda3/envs/py3/bin/python
deploy-mode client(5) cluster(6)
--archives ~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON py3.zip/py3/bin/python
failure java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory [executor side PythonRunner]

case 7

key value
PYSPARK_PYTHON ~/anaconda3/envs/py3/bin/python
deploy-mode cluster
--archives ~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON py3.zip/py3/bin/python
**success ** --

case 8

key value
PYSPARK_PYTHON ~/anaconda3/envs/py3/bin/python
deploy-mode client
--archives ~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON py3.zip/py3/bin/python
failure java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory [executor side PythonRunner]

case 9

key value
not settingPYSPARK_[DRIVER]_PYTHON
deploy-mode client
--archives ~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON py3.zip/py3/bin/python
failure ImportError: No module named numpy

case 10

key value
not settingPYSPARK_[DRIVER]_PYTHON
deploy-mode cluster
--archives ~/anaconda3/envs/py3.zip
spark.yarn.appMasterEnv.PYSPARK_PYTHON py3.zip/py3/bin/python
spark.executorEnv. PYSPARK_DRIVER_PYTHON py3.zip/py3/bin/python
success --

my humble opinions

  1. spark.executorEnv. PYSPARK_* takes no affect on executor side pythonExec, which is determined by driver.
  2. if PYSPARK_PYTHON is specified then spark.yarn.appMasterEnv. should be suffixed by PYSPARK_PYTHON not PYSPARK_DRIVER_PYTHON. This may be a priority problem.
  3. specifying PYSPARK_DRIVER_PYTHON fails all the cases, it may be caused by https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L191 only deal with PYSPARK_PYTHON, this should be fix too
    4、yarn client mode fails all the cases above, which means that I have to implement my client side python manually to all cluster nodes to ensure them to be all the same.

PS: Why there are two python home variables -- PYSPARK_DRIVER_PYTHON & PYSPARK_PYTHON?

@ueshin
Copy link
Member

ueshin commented Dec 4, 2017

@yaooqinn What's the difference between case 7 and 8? Looks like the same configuration but the different result?

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 4, 2017

@ueshin case 8 should be client deploy mode, excuse me for copy mistake, fixed

@ueshin
Copy link
Member

ueshin commented Dec 4, 2017

@yaooqinn OK, I see the situation.

In client mode, I think we can't use spark.yarn.appMasterEnv.XXX which is for cluster mode. So we should use environment variable PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON, or corresponding spark conf, spark.pyspark.python, spark.pyspark.driver.python.

In cluster mode, we can use spark.yarn.appMasterEnv.XXX and if there exist spark.yarn.appMasterEnv.PYSPARK_PYTHON or spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON, they overwrite original environment variables.

Btw, PYSPARK_DRIVER_PYTHON is for only Driver, not Executors, so we should handle only PYSPARK_PYTHON in executor and the priority of PYSPARK_DRIVER_PYTHON is higher than PYSPARK_PYTHON in Driver.

Currently we handle only environment varibale but not spark.executorEnv.PYSPARK_PYTHON for executor so we should handle it at api/python/PythonRunner as you do now or context.py#L191.

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 4, 2017

@ueshin I can spark.executorEnv.PYSPARK_PYTHON in sparkConf at executor side , because it is set at context.py#L156 by conf.py#L153

@ueshin
Copy link
Member

ueshin commented Dec 4, 2017

@yaooqinn I meant it is not used for pythonExec.

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 4, 2017

@ueshin i see.

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 4, 2017

@ueshin context.py#L191 set for both driver and executor?

@ueshin
Copy link
Member

ueshin commented Dec 4, 2017

@yaooqinn It is used for executors.

@vanzin
Copy link
Contributor

vanzin commented Dec 4, 2017

I'm trying to understand what is https://github.com/apache/spark/blob/master/python/pyspark/context.py#L191 really achieving. It seems pretty broken to me and feels like the whole pythonExec tracking in the various places should be removed.

It causes this problem because it forces the executor to use the driver's python even if it's been set to a different path by the user.

It uses python instead of sys.executable as the default value.

And it ignores the spark.pyspark.python config value if it's set.

Instead, shouldn't the logic at https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L304 be used in PythonRunner (except for the driver python config) to find out the executor's python to use?

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 5, 2017

@vanzin
according to @ueshin 's explanation, PYSPARK_DRIVER_PYTHON is only for driver, if executor follows the order of SparkSubmitCommandBuilder.java#L304, we may not need so many configs, right?

@vanzin
Copy link
Contributor

vanzin commented Dec 5, 2017

That's what I said in my comment ("except for the driver python config").

@vanzin
Copy link
Contributor

vanzin commented Dec 12, 2017

So, any updates here?

@vanzin
Copy link
Contributor

vanzin commented May 14, 2018

@yaooqinn do you plan to update this PR?

@yaooqinn
Copy link
Member Author

@vanzin I am not very familiar with python part context.py#L191, so handle it at api/python/PythonRunner as I did in this pr.

Maybe someone else could help, sorry for the delay.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@tgravescs
Copy link
Contributor

I didn't read the entire thread here but what you want is this:

--archives hdfs:///python36/python36.tgz#python36 --conf spark.pyspark.python=./python36/bin/python3.6 --conf spark.executorEnv.LD_LIBRARY_PATH=./python36/lib --driver-library-path /opt/python36/lib --conf spark.pyspark.driver.python=/opt/python36/bin/python3.6

@vanzin
Copy link
Contributor

vanzin commented Dec 10, 2018

@yaooqinn we should probably close this if you're not planning to look at the root of the problem, which seems to be on the python side.

@yaooqinn yaooqinn closed this Dec 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants