Flink:DataSet.count() 是瓶颈 - 如何并行计数? [英] Flink: DataSet.count() is bottleneck - How to count parallel?

查看:20
本文介绍了Flink:DataSet.count() 是瓶颈 - 如何并行计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Flink 学习 Map-Reduce,并且有一个关于如何有效计算 DataSet 中元素的问题.到目前为止,我所拥有的是:

I am learning Map-Reduce using Flink and have a question about how to efficiently count elements in a DataSet. What I have so far is this:

DataSet<MyClass> ds = ...;
long num = ds.count();

执行此操作时,在我的 flink 日志中显示

When executing this, in my flink log it says

12/03/2016 19:47:27 DataSink (count())(1/1) 切换到 RUNNING

12/03/2016 19:47:27 DataSink (count())(1/1) switched to RUNNING

所以只使用了一个 CPU(我有四个和其他命令,如 reduce 使用所有这些).

So there is only one CPU used (i have four and other commands like reduce use all of them).

我认为 count() 在内部从所有四个 CPU 收集数据集并按顺序计算它们,而不是让每个 CPU 计算它的部分然后总结它.是真的吗?

I think count() internally collects the DataSet from all four CPUs and counts them sequentially instead of having each CPU count its part and then sum it up. Is that true?

如果是,我如何利用我所有的 CPU?首先将我的 DataSet 映射到包含原始值作为第一项和长值 1 作为第二项的 2 元组,然后使用 SUM 函数对其进行聚合是个好主意吗?

If yes, how can I take advantage of all my CPUs? Would it be a good idea to first map my DataSet to a 2-tuple that contains the original value as first item and the long value 1 as second item and then aggregate it using the SUM function?

例如,DataSet 将映射到 DataSet>,其中 Long 始终为 1.因此,当我对所有项目求和时,元组的第二个值的总和将是正确的计数值.

For example, the DataSet would be mapped to DataSet> where the Long would always be 1. So when I sum up all items the sum of the second value of the tuple would be the correct count value.

对数据集中的项目进行计数的最佳做法是什么?

What is the best practice to count items in a DataSet?

问候西蒙

推荐答案

DataSet#count() 是非并行操作,因此只能使用单线程.

DataSet#count() is a non-parallel operation and thus can only use a single thread.

您将进行逐键计数以获得并行化,并对您的键计数应用最终总和以获得总计数以加快计算速度.

You would do a count-by-key to get parallelization and apply a final sum over you key counts to get to overall count to speed up you computation.

这篇关于Flink:DataSet.count() 是瓶颈 - 如何并行计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆