Spark:如何加快foreachRDD? [英] Spark : How to speedup foreachRDD?

查看:115
本文介绍了Spark:如何加快foreachRDD?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个Spark流应用程序,它以 @ 10,000/sec 的方式接收数据... ...我们在DStream上使用foreachRDD操作(因为spark不会执行,除非它在DStream上找到输出操作)

We have a Spark streaming application which ingests data @10,000/ sec ... We use the foreachRDD operation on our DStream( since spark doesn't execute unless it finds the output operation on DStream)

所以我们必须使用这样的foreachRDD输出操作,最多需要 3个小时 ...才能写入的单批数据(10,000)

so we have to use the foreachRDD output operation like this , it takes upto to 3 hours ...to write a singlebatch of data (10,000) which is slow

代码段1:

requestsWithState.foreachRDD { rdd =>

     rdd.foreach {
     case (topicsTableName, hashKeyTemp, attributeValueUpdate) => {          
          val client = new AmazonDynamoDBClient()
          val request = new UpdateItemRequest(topicsTableName, hashKeyTemp, attributeValueUpdate)
          try client.updateItem(request)

        catch {
            case se: Exception => println("Error executing updateItem!\nTable ", se)
         }
        }
        case null =>
      }
    }
  } 

所以我认为foreachRDD内的代码可能是问题,所以注释掉它以查看需要多少时间.... 令我惊讶的是...即使在foreachRDD内没有代码,它仍然可以运行3个小时

So i thought the code inside foreachRDD might be the problem so commented it out to see how much time it takes ....to my surprise ...even with nocode inside the foreachRDD it still run's for 3 hours

代码段2:

requestsWithState.foreachRDD { 
rdd => rdd.foreach { 
// No code here still takes a lot of time ( there used to be code but removed it to see if it's any faster without code) // 
 }
}  

请让我们知道我们是否缺少任何内容或其他替代方法,因为据我了解,如果不对DStream Spark流应用程序进行输出操作,则无法运行..目前,我无法使用其他输出操作...

Please let us know if we are missing anything or an alternative way to do this as i understand without a output operation on DStream spark streaming application will not run .. at this time i can't use other output operations ...

注意:要找出问题并确保发电机代码没有问题...我运行了空循环.....看起来像foreachRDD在遍历即将到来的巨大记录集时速度很慢以@ 10,000/sec的速度运行...而不是将dynamo代码作为空的foreachRDD和具有dynamo代码的代码花费了相同的时间...

屏幕截图显示了 foreachRDD 所执行的所有阶段以及所花费的时间,即使它是强制循环的且内部没有代码

ScreenShot showing all the stages that are executed and time taken by foreachRDD even though it's jus looping and no code inside

foreachRDD空循环所花费的时间

用于foreachRDD空循环的9个工作节点之间的大型运行任务的任务分配...

推荐答案

我知道已经晚了,但是如果您想听,我有一些猜测,可能会为您提供一些见识.

I know it is late,but if you like to hear,I have some guess that may give you some insights.

不是 rdd.foreach 内部的代码花费很长时间,而是 .rdd.foreach 之前的代码,即生成rdd的代码.转换是惰性的,直到您使用结果,spark才会对其进行计算.当代码在 rdd.foreach 中运行时,sp​​ark进行计算并生成数据行.rdd.foreach循环中的代码仅处理结果.您可以通过注释rdd.foreach

It is not the code inside rdd.foreach that takes long time,but the code before rdd.foreach, the code which generate the rdd. Transformations are lazy,spark does not compute it until you use the result. When code runs in rdd.foreach,spark do the computation,and generate the data rows.The code in rdd.foreach loops only manipulate the result. You can check this by commenting out the rdd.foreach

requestsWithState.foreachRDD { 
  //rdd => rdd.foreach { 
  // No code here still takes a lot of time ( there used to be code but removed it to //see if it's any faster without code) 
  //}
} 

我想这将是非常快的,因为不会进行任何计算.或者您可以将转换更改为非常简单的转换,它也会很快.它不能解决您的问题,但是如果我说对了,它将帮助您找到问题所在.

I guess it will be extremely fast,because no computations happens. Or you can change the transformations to a very simple one,it will be fast too. It does not solve your problem,but if I'm right,it will help you locate your problem.

这篇关于Spark:如何加快foreachRDD?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆