Flink:DataSet.count()是瓶颈-如何并行计算? [英] Flink: DataSet.count() is bottleneck - How to count parallel?

查看:791
本文介绍了Flink:DataSet.count()是瓶颈-如何并行计算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Flink学习Map-Reduce,并且对如何有效地计算数据集中的元素有疑问.到目前为止,我的情况是这样:

I am learning Map-Reduce using Flink and have a question about how to efficiently count elements in a DataSet. What I have so far is this:

DataSet<MyClass> ds = ...;
long num = ds.count();

执行此操作时,在我的flink日志中说

When executing this, in my flink log it says

12/03/2016 19:47:27 DataSink(count())(1/1)切换为RUNNING

12/03/2016 19:47:27 DataSink (count())(1/1) switched to RUNNING

所以只使用了一个CPU(我有四个,其他命令如reduce都使用了它们.)

So there is only one CPU used (i have four and other commands like reduce use all of them).

我认为count()在内部从所有四个CPU收集DataSet并按顺序对其进行计数,而不是让每个CPU对其部分进行计数然后求和.是真的吗?

I think count() internally collects the DataSet from all four CPUs and counts them sequentially instead of having each CPU count its part and then sum it up. Is that true?

如果是,我如何利用我所有的CPU?首先将我的DataSet映射到一个包含原始值作为第一项并包含long值1作为第二项的2元组,然后使用SUM函数对其进行汇总是一个好主意吗?

If yes, how can I take advantage of all my CPUs? Would it be a good idea to first map my DataSet to a 2-tuple that contains the original value as first item and the long value 1 as second item and then aggregate it using the SUM function?

例如,DataSet将映射到Long始终为1的DataSet>.因此,当我对所有项求和时,元组的第二个值的总和将是正确的计数值.

For example, the DataSet would be mapped to DataSet> where the Long would always be 1. So when I sum up all items the sum of the second value of the tuple would be the correct count value.

对数据集中的项目进行计数的最佳做法是什么?

What is the best practice to count items in a DataSet?

问候 西蒙

推荐答案

DataSet#count()是非并行操作,因此只能使用单个线程.

DataSet#count() is a non-parallel operation and thus can only use a single thread.

您将按键进行计数以获得并行化,并对键计数应用最终的总和,以得出总计数以加快计算速度.

You would do a count-by-key to get parallelization and apply a final sum over you key counts to get to overall count to speed up you computation.

这篇关于Flink:DataSet.count()是瓶颈-如何并行计算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆