为什么地方级别的任何数据集用于在HDFS？ [英] Why is locality level ANY for dataset in HDFS?

查看：172 发布时间：2016/5/22 16:03:35 hadoop apache-spark hdfs

本文介绍了为什么地方级别的任何数据集用于在HDFS？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我跑了12个节点（8G内存和8个内核每个）的火花集群的一些测试。

I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.

我想弄清楚为什么在地图阶段，一个简单的单词计数的应用程序的数据局部性都是任意。该数据集14GB存储在HDFS。

I'm trying to figure out why data locality of a simple wordcount app in "map" stage is all "Any". The 14GB dataset is stored in HDFS.

推荐答案

我遇到了同样的问题，在我的情况下，它是与配置有问题。我在EC2上运行我有一个名不匹配。也许同样的事情发生在你身上。

I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.

当您检查HDFS如何看待您群集应该沿着这条路线的东西：

When you check how HDFS sees you cluster it should be something along this lines:

hdfs dfsadmin -printTopology
Rack: /default-rack
   172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
   172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)

和同样应执行者在UI地址（默认情况下它的 HTTP可以看出：//您的支持群集的公共DNS：8080 / ）

And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).

在我来说，我是用公共主机名火花奴隶。我在改变了我的 SPARK_LOCAL_IP $ SPARK / conf目录/ spark-env.sh 来使用的专用名称为好，而这种改变让我在 NODE_LOCAL 大部分的时间。

In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP in $SPARK/conf/spark-env.sh to use the private name as well, and after that change I get NODE_LOCAL most of the times.

这篇关于为什么地方级别的任何数据集用于在HDFS？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么地方级别的任何数据集用于在HDFS？ [英] Why is locality level ANY for dataset in HDFS?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么地方级别的任何数据集用于在HDFS？ [英] Why is locality level ANY for dataset in HDFS?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭