SparkR和包 [英] SparkR and Packages
问题描述
如何被用于与R数据业务从火花一个呼叫包?
How do one call packages from spark to be utilized for data operations with R?
例如我试图访问我的test.csv在HDFS如下
example i am trying to access my test.csv in hdfs as below
Sys.setenv(SPARK_HOME="/opt/spark14")
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
flights <- read.df(sqlContext,"hdfs://sandbox.hortonWorks.com:8020 /user/root/test.csv","com.databricks.spark.csv", header="true")
但得到如下错误:
but getting error as below:
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
我试着以下选项加载CSV包
i tried loading the csv package by below option
Sys.setenv('SPARKR_SUBMIT_ARGS'='--packages com.databricks:spark-csv_2.10:1.0.3')
但加载sqlContext时提示以下错误:
but getting the below error during loading sqlContext
Launching java with spark-submit command /opt/spark14/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 /tmp/RtmpuvwOky /backend_port95332e5267b
Error: Cannot load main class from JAR file:/tmp/RtmpuvwOky/backend_port95332e5267b
任何帮助将是非常美联社preciated。
Any help will be highly appreciated.
推荐答案
所以看起来通过设置 SPARKR_SUBMIT_ARGS
要覆盖默认值,即 sparkr壳
。你也许可以做同样的事情,只是追加sparkr壳您SPARKR_SUBMIT_ARGS的结束。相比取决于所以我创建了一个JIRA罐子来跟踪这个问题(我会尝试修复,如果SparkR人同意我的看法)的 https://issues.apache.org/jira/browse/SPARK-8506 。
So it looks like by setting SPARKR_SUBMIT_ARGS
you are overriding the default value, which is sparkr-shell
. You could probably do the same thing and just append sparkr-shell to the end of your SPARKR_SUBMIT_ARGS. This is seems unnecessarily complex compared to depending on jars so I've created a JIRA to track this issue (and I'll try and a fix if the SparkR people agree with me) https://issues.apache.org/jira/browse/SPARK-8506 .
注:另一种选择是使用sparkr命令+ - 包com.databricks:火花csv_2.10:1.0.3
,因为这应该工作
Note: another option would be using the sparkr command + --packages com.databricks:spark-csv_2.10:1.0.3
since that should work.
这篇关于SparkR和包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!