加载pyspark code内外部库 [英] load external libraries inside pyspark code

查看:278
本文介绍了加载pyspark code内外部库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我在本地模式下使用火花集群。我想读的databricks外部库spark.csv一个CSV文件。我开始我的应用程序如下:

 导入OS
进口SYSos.environ [SPARK_HOME] =/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6spark_home = os.environ.get('SPARK_HOME',无)
sys.path.insert(0,spark_home +/蟒蛇)
sys.path.insert(0,os.path.join(spark_home,'巨蟒/ lib目录/ py4j-0.8.2.1-src.zip'))从pyspark进口SparkContext,SparkConf,SQLContext尝试:
    SC
除了NameError:
    打印(初始化SparkContext ......)
    SC = SparkContext()
平方= SQLContext(SC)
DF = sq.read.format('com.databricks.spark.csv')。选项(标题=真,则InferSchema =真)。负载(/我的/路径/要/我的/ FILE.CSV )

当我运行它,我得到以下错误:

 抛出java.lang.ClassNotFoundException:无法加载数据源类:com.databricks.spark.csv。

我的问题:我怎么能加载databricks.spark.csv库中我的Python code。我不想从外部(使用--packages)从实例加载它。

我试着加上下面几行,但它没有工作:

  os.environ [SPARK_CLASSPATH] ='/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar


解决方案

如果您创建 SparkContext 从头例如,您可以设置 PYSPARK_SUBMIT_ARGS 之前 SparkContext 是intialized:

  os.environ [PYSPARK_SUBMIT_ARGS] =(
  --packages com.databricks:火花csv_2.11:1.3.0 pyspark壳
)SC = SparkContext()

如果由于某种原因,你想到的是 SparkContext 是否已被初始化,因为它是由您的code的建议,这是不行的。在本地模式下,你可以尝试使用Py4J网关和的URLClassLoader ,但它看起来并不像一个好主意,不会在集群模式下运行。

I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows:

import os
import sys

os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6"

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

from pyspark import SparkContext, SparkConf, SQLContext

try:
    sc
except NameError:
    print('initializing SparkContext...')
    sc=SparkContext()
sq = SQLContext(sc)
df = sq.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/my/path/to/my/file.csv")

When I run it, I get the following error:

java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.

My question: how can I load the databricks.spark.csv library INSIDE my python code. I don't want to load it from outside (using --packages) from instance.

I tried to add the following lines but it did not work:

os.environ["SPARK_CLASSPATH"] = '/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar'

解决方案

If you create SparkContext from scratch you can for example set PYSPARK_SUBMIT_ARGS before SparkContext is intialized:

os.environ["PYSPARK_SUBMIT_ARGS"] = (
  "--packages com.databricks:spark-csv_2.11:1.3.0 pyspark-shell"
)

sc = SparkContext()

If for some reason you expect that SparkContext has been already initialized, as it is suggested by your code, this won't work. In local mode you could try to use Py4J gateway and URLClassLoader but it doesn't look like a good idea and won't work in a cluster mode.

这篇关于加载pyspark code内外部库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆