为什么RDD计算计数需要这么多时间 [英] Why RDD calculating count take so much time

查看:176
本文介绍了为什么RDD计算计数需要这么多时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(英语不是我的母语,所以请原谅任何错误)

(English is not my first language so please excuse any mistakes)

我使用SparkSQL从蜂巢表中读取4.7TB数据,并执行计数操作.大约需要1.6个小时.直接从HDFS txt文件读取并执行计数时,仅需10分钟.这两个作业使用相同的资源和并行度.为什么RDD计数需要这么多时间?

I use SparkSQL reading 4.7TB data from hive table, and performing a count operation. It takes about 1.6 hours to do that. While reading directly from HDFS txt file and performing count, it takes only 10 minutes. The two jobs used same resources and parallelism. Why RDD count takes so much time?

配置单元表具有约30万列,并且序列化可能会很昂贵.我检查了Spark UI,每个任务读取了大约240MB数据,大约需要3.6分钟来执行.我不敢相信序列化的开销如此之高.

The hive table has about 3000 thousand columns, and maybe serialization is costly. I checked the spark UI and each tasks read about 240MB data and take about 3.6 minutes to execute. I can't believe that serialization overhead is so expensive.

从蜂巢中读取(耗时1.6个小时):

Reading from hive(taking 1.6 hours):

val sql = s"SELECT * FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
val count = hiveData.count()

从hdfs读取(需要10分钟):

Reading from hdfs(taking 10 minutes):

val inputPath = s"/path/to/above/hivetable"
val hdfsData = sc.textFile(inputPath)
val count = hdfsData.count()

使用SQL计数时,仍然需要5分钟:

While using SQL count, it still takes 5 minutes:

val sql = s"SELECT COUNT(*) FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
hiveData.foreach(println(_))

推荐答案

您的第一种方法是查询数据,而不是获取数据.很大的不同.

Your first method is querying the data instead of fetching the data. Big difference.

val sql = s"SELECT * FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd

我们可以以程序员的身份查看上面的代码,并认为是的,这就是我们抓取所有数据的方式".但是,获取数据的方式是通过查询而不是从文件中读取数据.基本上,将发生以下步骤:

We can look at the above code as programmers and think "yes, this is how we grab all of the data". But the way that the data is being grabbed is via query instead of reading it from a file. Basically, the following steps occur:

  • 从文件读取到临时存储区
  • 查询引擎处理临时存储上的查询并创建结果
  • 将结果读入RDD

那里有很多步骤!比以下情况更重要:

There's a lot of steps there! More so than what occurs by the following:

val inputPath = s"/path/to/above/hivetable"
val hdfsData = sc.textFile(inputPath)

在这里,我们只有一步:

Here, we just have one step:

  • 从文件读取到RDD

看,那是步骤的1/3.尽管这是一个简单的查询,但要使其进入该RDD仍需要大量的开销和处理.一旦将其放入RDD中,处理将变得更加容易.如您的代码所示:

See, that's 1/3 of the steps. Even though it is a simple query, there is still a lot of overhead and processing involved in order to get it into that RDD. Once it's in the RDD though, processing will be easier. As shown by your code:

val count = hdfsData.count()

这篇关于为什么RDD计算计数需要这么多时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆