flink计算流的中位数 [英] flink calculate median on stream

查看:225
本文介绍了flink计算流的中位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算从卡夫卡流中接收的许多参数的中位数,时间间隔为15分钟.

I'm required to calculate median of many parameters received from a kafka stream for 15 min time window.

我找不到任何内置函数,但是我找到了一种使用自定义WindowFunction的方法.

i couldn't find any built in function for that, but I have found a way using custom WindowFunction.

我的问题是:

  1. 对于flink来说这是一项艰巨的任务吗?数据可能非常大.
  2. 如果数据达到千兆字节,flink会将所有内容存储在内存中直到时间窗口结束吗?(apply WindowFunction实现的参数之一是Iterable-在时间窗口内收集的所有数据的集合)

谢谢

推荐答案

您的问题包含多个方面,但让我回答最基本的一个方面:

Your question contains several aspects, but let me answer the most fundamental one:

这对Flink来说是一项艰巨的任务,为什么这不是标准示例?

是的,中位数是一个很难的概念,因为确定中位数的唯一方法是保留完整数据.

Yes, the median is a hard concept, as the only way to determine it is to keep the full data.

许多统计信息都不需要计算完整的数据.例如:

Many statistics don't need the full data to be calculated. For instance:

  • 如果有总计,则可以取之前的总计并添加最新的观测值.
  • 如果您有总数,则加1并得到新的总数
  • 如果有平均值,则可以在后台进行总和和计数的跟踪,并随时根据观察结果计算新的平均值.

这甚至可以使用更复杂的指标(例如标准差)来完成.

This can even be done with more complicated metrics, like the standard deviation.

但是,确定中位数没有捷径,添加新观测值后知道中位数是什么的唯一方法是查看所有观测值,然后找出中间值是什么.

However, there is no shortcut for determining the median, the only way to know what the median is after adding a new observation, is by looking at all observations and then figuring out what the middle one is.

因此,这是一个具有挑战性的指标,需要处理传入的数据大小.如前所述,可能会有类似的估算值: https://issues.apache.org/jira/browse/FLINK-2147

As such, it is a challenging metric and the size of the data that comes in will need to be handled. As mentioned there may be estimates in the workings like this: https://issues.apache.org/jira/browse/FLINK-2147

或者,您可以查看数据的分布方式,并可以使用均值,偏度和峰度等指标来估计中位数.

Alternately, you could look at how your data is distributed, and perhaps estimate the median with metrics like Mean, Skew, and Kurtosis.

我可以想出的最后一个解决方案是,如果您需要大概知道该值是多少,请选择一些候选对象"并计算它们下面的观测值的分数.那么最接近50%的那个就是一个合理的估算值.

A final solution I could come up with, is if you need to know approximately what the value should be, is to pick a few 'candidates' and count the fractin of observations below them. The one closest to 50% would then be a reasonable estimate.

这篇关于flink计算流的中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆