为什么星火快时字数? [英] Why is Spark fast when word count?

查看:182
本文介绍了为什么星火快时字数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

测试用例:字计数在6G数据20+秒星火

Test case: word counting in 6G data in 20+ seconds by Spark.

我理解的麻preduce FP 的和的的编程模型,但想不出单词统计是如此惊人快速。

I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.

我认为这是在这种情况下,I / O密集​​型计算,它是不可能在扫描超过20秒6G文件。我想有类似Lucene的不字计数之前进行索引。魔术应该在 RDD (弹性分布式数据集)设计,我不明白不够好。

I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.

我AP preciate如果有人能解释的 RDD 作为字计数情况。谢谢!

I appreciate if anyone could explain RDD for the word counting case. Thanks!

推荐答案

一是启动时间。 Hadoop的马preduce工作启动开始需要一些独立的JVM这并不快。星火工作启动(现有星火集群上)会导致现有的JVM派生新任务线程,比JVM启动

First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM

其次,没有索引,也没有魔法。 6GB文件存储在每个128MB的47块。想象一下,你有所有这些47 HDFS块都驻留在不同的硬盘驱动器的JBOD一个足够大的Hadoop集群。他们每个人都将提供你70 MB /秒的扫描速度,这意味着你可以阅读约2秒这一数据。随着集群中的万兆以太网网络,可以在短短7秒钟一台计算机传输所有数据到另一个。

Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.

最后,Hadoop的把中间数据到磁盘的次数。它把地图输出到磁盘至少一次(甚至更多,如果地图产量大和磁盘上合并发生)。它把数据盘下一次前减少减少本身执行。火花在随机阶段提出的数据硬盘只有一次,参考星火实施建议增加文件系统的写缓存不使这个'洗牌'的数据命中磁盘

Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks

所有这一切都使星火相比,Hadoop的一个大的性能提升。有一个在星火RDDS没有魔法与此相关的问题。

All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question

这篇关于为什么星火快时字数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆