数据访问星火EC2 [英] Data access Spark EC2
问题描述
下面的指令通过EC2脚本安装集群后,我不能够正确地启动我的.jar,因为他们没有找到我穿上/根/持久HDFS /在主机和从机节点的数据文件。 我读的其他职位,我需要preFIX与文件中的文件位置://,但它不会改变任何东西......我有这样的错误:
After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes. I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :
在线程异常主要org.apache.hadoop.ma pred.InvalidInputException:输入路径不存在:文件://root/persistent-hdfs/data/ds_1.csv
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv
要推出我使用的主节点上的./bin/spark-submit工作,对吗?
To launch the job i used the ./bin/spark-submit on the master node, am i correct ?
感谢您提前为您的支持。
Thank you in advance for your support.
推荐答案
有几件事情你需要做的:
There are a few things you need to do:
- 在默认配置使用的临时HDFS,所以你需要把它们关掉
$ /root/ephemeral-hdfs/bin/stop-all.sh
并开启持续对$ /root/persistent-hdfs/bin/start-all.sh
。 - 将你的文件转化为简单的持久性HDFS的根目录
$ /根/持久HDFS /斌/ Hadoop的FS -put /root/ds_1.csv /ds_1.csv$c$c >。现在,检查一下它实际上是有
$ /根/持久HDFS /斌/ Hadoop的FS -ls
。 - 最后,在
/root/spark/conf/spark-defaults.conf
和/根/火花/ conf目录/火花编辑星火的配置文件-env.sh
和改变一切,说短暂的持久性。
- The default configuration uses the ephemeral hdfs so you need to turn that off
$ /root/ephemeral-hdfs/bin/stop-all.sh
and turn persistent on$ /root/persistent-hdfs/bin/start-all.sh
. - Put your file into the persistent hdfs root directory for simplicity
$ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv
. Now check to see it is actually there$ /root/persistent-hdfs/bin/hadoop fs -ls
. - Finally, edit Spark's configuration files in
/root/spark/conf/spark-defaults.conf
and/root/spark/conf/spark-env.sh
and change everything that says ephemeral to persistent.
假设你把你的CSV的执着HDFS的根目录下(如我们在步骤2中所做的那样),你可以使用 VAL RAWDATA = sc.textFile(/ ds_1.csv访问它的火花)
。
Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv")
.
玩得开心!
这篇关于数据访问星火EC2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!