在Spark中,计算RDD昂贵任务中的记录数量? [英] In Spark is counting the records in an RDD expensive task?

查看:174
本文介绍了在Spark中,计算RDD昂贵任务中的记录数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Hadoop中,当我使用输入格式阅读器时,作业级别的日志报告有多少条记录被读取,它还显示字节数等。

在Spark当我使用相同的输入格式阅读器时,我得不到那些指标。



所以我在考虑使用inputformat reader来填充rdd,然后发布(rdd的大小)。



我知道 rdd.count()返回rdd的大小。



然而,使用 count()的成本对我而言并不清楚?例如:


  • 它是一个分布式函数吗?每个分区是否会报告其计数并统计并报告计数?或者是整个rdd带入驱动程序并计数?

  • 在执行 count()之后,rdd仍会保留在内存中或我必须显式缓存它吗?

  • 有没有更好的方法来做我想做的事情,即在对它们进行操作之前对它们进行计数?


  • 解决方案


    它是一个分布式函数吗?每个分区是否会报告其计数
    并统计并报告计数?或者是整个rdd将
    带入驱动程序并计数?

    计数是分布式的。在火花术语中,count是一个Action。所有操作都是分发的。真的,只有很少的东西会把所有的数据带到驱动程序节点,而且它们通常都有很好的文档记录(例如,拿,收集等)。

    blockquote




    执行count()后,rdd仍然会保留在内存中,还是
    我必须显式缓存它?没有,数据将不会在内存中。如果你想要它,你需要在计数之前明确缓存。在采取行动之前,Spark的懒惰评估不会进行任何计算。除非有缓存调用,否则没有数据会存储在内存中。


    是否有更好的方法可以做我想做的事要计算
    记录之前对它们进行操作吗?

    缓存,计数,操作看起来像一个坚实的计划


    In Hadoop, when I use an inputformat reader the logs at the job level report how many records were read, it also displays the byte count etc.

    In Spark when I use the same inputformat reader I get non of those metrics.

    So I'm thinking that I would use the inputformat reader to populate the rdd, and then just publish the number of records in the rdd (size of the rdd).

    I know that rdd.count() returns the size of the rdd.

    However, the cost of using count() is not clear to me? For example:

    • Is it a distributed function? Will each partition report its count and the counts are summed and reported? Or is the entire rdd brought into the driver and counted?
    • After executing the count() will the rdd still remain in memory or do I have to explicitly cache it?
    • Is there a better way to do what I want to do, namely count the records before operating on them?

    解决方案

    Is it a distributed function? Will each partition report its count and the counts are summed and reported? Or is the entire rdd brought into the driver and counted?

    Count is distributed. In spark nomenclature, count is an "Action". All actions are distributed. Really, there are only a handful things that bring all of the data to the driver node and they are generally well documented (eg take, collect etc)

    After executing the count() will the rdd still remain in memory or do I have to explicitly cache it?

    No, the data will not be in memory. If you want it to be, you need to explicitly cache before counting. Spark's lazy evaluation will not make any computations until an Action is taken. And no data will be stored in memory after an Action unless there was a cache call.

    Is there a better way to do what I want to do, namely count the records before operating on them?

    Cache, count, operating seems like a solid plan

    这篇关于在Spark中,计算RDD昂贵任务中的记录数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆