为什么地方级别的任何数据集用于在HDFS? [英] Why is locality level ANY for dataset in HDFS?

查看:172
本文介绍了为什么地方级别的任何数据集用于在HDFS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我跑了12个节点(8G内存和8个内核每个)的火花集群的一些测试。

I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.

我想弄清楚为什么在地图阶段,一个简单的单词计数的应用程序的数据局部性都是任意。该数据集14GB存储在HDFS。

I'm trying to figure out why data locality of a simple wordcount app in "map" stage is all "Any". The 14GB dataset is stored in HDFS.

在这里输入的形象描述

在这里输入的形象描述

推荐答案

我遇到了同样的问题,在我的情况下,它是与配置有问题。我在EC2上运行我有一个名不匹配。也许同样的事情发生在你身上。

I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.

当您检查HDFS如何看待您群集应该沿着这条路线的东西:

When you check how HDFS sees you cluster it should be something along this lines:

hdfs dfsadmin -printTopology
Rack: /default-rack
   172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
   172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)

和同样应执行者在UI地址(默认情况下它的 HTTP可以看出://您的支持群集的公共DNS:8080 /

And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).

在我来说,我是用公共主机名火花奴隶。我在改变了我的 SPARK_LOCAL_IP $ SPARK / conf目录/ spark-env.sh 来使用的专用名称为好,而这种改变让我在 NODE_LOCAL 大部分的时间。

In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP in $SPARK/conf/spark-env.sh to use the private name as well, and after that change I get NODE_LOCAL most of the times.

这篇关于为什么地方级别的任何数据集用于在HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆