数据访问星火EC2 [英] Data access Spark EC2

查看:200
本文介绍了数据访问星火EC2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的指令通过EC2脚本安装集群后,我不能够正确地启动我的.jar,因为他们没有找到我穿上/根/持久HDFS /在主机和从机节点的数据文件。 我读的其他职位,我需要preFIX与文件中的文件位置://,但它不会改变任何东西......我有这样的错误:

After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes. I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :

在线程异常主要org.apache.hadoop.ma pred.InvalidInputException:输入路径不存在:文件://root/persistent-hdfs/data/ds_1.csv

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv

要推出我使用的主节点上的./bin/spark-submit工作,对吗?

To launch the job i used the ./bin/spark-submit on the master node, am i correct ?

感谢您提前为您的支持。

Thank you in advance for your support.

推荐答案

有几件事情你需要做的:

There are a few things you need to do:

  1. 在默认配置使用的临时HDFS,所以你需要把它们关掉 $ /root/ephemeral-hdfs/bin/stop-all.sh 并开启持续对 $ /root/persistent-hdfs/bin/start-all.sh
  2. 将你的文件转化为简单的持久性HDFS的根目录 $ /根/持久HDFS /斌/ Hadoop的FS -put /root/ds_1.csv /ds_1.csv。现在,检查一下它实际上是有 $ /根/持久HDFS /斌/ Hadoop的FS -ls
  3. 最后,在 /root/spark/conf/spark-defaults.conf /根/火花/ conf目录/火花编辑星火的配置文件-env.sh 和改变一切,说短暂的持久性。
  1. The default configuration uses the ephemeral hdfs so you need to turn that off $ /root/ephemeral-hdfs/bin/stop-all.sh and turn persistent on $ /root/persistent-hdfs/bin/start-all.sh.
  2. Put your file into the persistent hdfs root directory for simplicity $ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv. Now check to see it is actually there $ /root/persistent-hdfs/bin/hadoop fs -ls.
  3. Finally, edit Spark's configuration files in /root/spark/conf/spark-defaults.conf and /root/spark/conf/spark-env.sh and change everything that says ephemeral to persistent.

假设你把你的CSV的执着HDFS的根目录下(如我们在步骤2中所做的那样),你可以使用 VAL RAWDATA = sc.textFile(/ ds_1.csv访问它的火花)

Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv").

玩得开心!

这篇关于数据访问星火EC2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆