为什么地方级别的任何数据集用于在HDFS? [英] Why is locality level ANY for dataset in HDFS?
问题描述
我跑了12个节点(8G内存和8个内核每个)的火花集群的一些测试。
I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.
我想弄清楚为什么在地图阶段,一个简单的单词计数的应用程序的数据局部性都是任意。该数据集14GB存储在HDFS。
I'm trying to figure out why data locality of a simple wordcount app in "map" stage is all "Any". The 14GB dataset is stored in HDFS.
推荐答案
我遇到了同样的问题,在我的情况下,它是与配置有问题。我在EC2上运行我有一个名不匹配。也许同样的事情发生在你身上。
I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.
当您检查HDFS如何看待您群集应该沿着这条路线的东西:
When you check how HDFS sees you cluster it should be something along this lines:
hdfs dfsadmin -printTopology
Rack: /default-rack
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
和同样应执行者在UI地址(默认情况下它的 HTTP可以看出://您的支持群集的公共DNS:8080 / )
And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).
在我来说,我是用公共主机名火花奴隶。我在改变了我的
来使用的专用名称为好,而这种改变让我在 SPARK_LOCAL_IP
$ SPARK / conf目录/ spark-env.sh NODE_LOCAL
大部分的时间。
In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP
in $SPARK/conf/spark-env.sh
to use the private name as well, and after that change I get NODE_LOCAL
most of the times.
这篇关于为什么地方级别的任何数据集用于在HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!