flink计算流的中位数 [英] flink calculate median on stream

查看：225 发布时间：2021/4/8 18:34:18 apache-flink

本文介绍了flink计算流的中位数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要计算从卡夫卡流中接收的许多参数的中位数，时间间隔为15分钟.

I'm required to calculate median of many parameters received from a kafka stream for 15 min time window.

我找不到任何内置函数，但是我找到了一种使用自定义WindowFunction的方法.

i couldn't find any built in function for that, but I have found a way using custom WindowFunction.

我的问题是:

对于flink来说这是一项艰巨的任务吗?数据可能非常大.
如果数据达到千兆字节，flink会将所有内容存储在内存中直到时间窗口结束吗?(apply WindowFunction实现的参数之一是Iterable-在时间窗口内收集的所有数据的集合)

谢谢

推荐答案

您的问题包含多个方面，但让我回答最基本的一个方面:

Your question contains several aspects, but let me answer the most fundamental one:

这对Flink来说是一项艰巨的任务，为什么这不是标准示例?

是的，中位数是一个很难的概念，因为确定中位数的唯一方法是保留完整数据.

Yes, the median is a hard concept, as the only way to determine it is to keep the full data.

许多统计信息都不需要计算完整的数据.例如:

Many statistics don't need the full data to be calculated. For instance:

如果有总计，则可以取之前的总计并添加最新的观测值.
如果您有总数，则加1并得到新的总数
如果有平均值，则可以在后台进行总和和计数的跟踪，并随时根据观察结果计算新的平均值.

这甚至可以使用更复杂的指标(例如标准差)来完成.

This can even be done with more complicated metrics, like the standard deviation.

但是，确定中位数没有捷径，添加新观测值后知道中位数是什么的唯一方法是查看所有观测值，然后找出中间值是什么.

However, there is no shortcut for determining the median, the only way to know what the median is after adding a new observation, is by looking at all observations and then figuring out what the middle one is.

因此，这是一个具有挑战性的指标，需要处理传入的数据大小.如前所述，可能会有类似的估算值: https://issues.apache.org/jira/browse/FLINK-2147

As such, it is a challenging metric and the size of the data that comes in will need to be handled. As mentioned there may be estimates in the workings like this: https://issues.apache.org/jira/browse/FLINK-2147

或者，您可以查看数据的分布方式，并可以使用均值，偏度和峰度等指标来估计中位数.

Alternately, you could look at how your data is distributed, and perhaps estimate the median with metrics like Mean, Skew, and Kurtosis.

我可以想出的最后一个解决方案是，如果您需要大概知道该值是多少，请选择一些候选对象"并计算它们下面的观测值的分数.那么最接近50％的那个就是一个合理的估算值.

A final solution I could come up with, is if you need to know approximately what the value should be, is to pick a few 'candidates' and count the fractin of observations below them. The one closest to 50% would then be a reasonable estimate.

这篇关于flink计算流的中位数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

flink计算流的中位数 [英] flink calculate median on stream

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

flink计算流的中位数 [英] flink calculate median on stream

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭