Spark Standalone集群无法读取本地文件系统中的文件 [英] Spark Standalone cluster cannot read the files in local filesystem

查看:77
本文介绍了Spark Standalone集群无法读取本地文件系统中的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark独立集群,其中有2个工作节点和1个主节点.

I have a Spark standalone cluster having 2 worker nodes and 1 master node.

使用spark-shell,我能够从本地文件系统上的文件中读取数据,然后进行了一些转换并将最终的RDD保存在/home/output中(假设)RDD已成功保存,但仅在一个工作节点和主节点上存在_SUCCESS文件.

Using spark-shell, I was able to read data from a file on local filesystem, then did some transformations and saved the final RDD in /home/output(let's say) The RDD got saved successfully but only on one worker node and on master node only _SUCCESS file was there.

现在,如果我想从/home/output读取此输出数据,我将不会获取任何数据,因为它正在master上获取0数据,然后我假设它没有为此检查其他工作程序节点.

Now, if I want to read this output data from /home/output, I am not getting any data as it is getting 0 data on master and then I am assuming that it is not checking the other worker nodes for that.

如果有人可以阐明为什么Spark无法从所有工作节点读取数据,或者Spark从工作节点读取数据的机制是什么,那就太好了.

It would be great if someone can throw some light on why Spark is not reading from all the worker nodes or what is the mechanism which Spark uses to read the data from worker nodes.

scala> sc.textFile("/home/output/")
res7: org.apache.spark.rdd.RDD[(String, String)] = /home/output/ MapPartitionsRDD[5] at wholeTextFiles at <console>:25

scala> res7.count
res8: Long = 0

推荐答案

您应将文件放入具有相同路径和名称的所有辅助计算机.

You should put the file to all worker machine with the same path and name.

这篇关于Spark Standalone集群无法读取本地文件系统中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆