flink 计算流中的中值 [英] flink calculate median on stream

查看:38
本文介绍了flink 计算流中的中值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算 15 分钟时间窗口内从 kafka 流接收到的许多参数的中值.

I'm required to calculate median of many parameters received from a kafka stream for 15 min time window.

我找不到任何内置函数,但我找到了使用自定义 WindowFunction 的方法.

i couldn't find any built in function for that, but I have found a way using custom WindowFunction.

我的问题是:

  1. 对于 flink 来说是一项艰巨的任务吗?数据可能非常大.
  2. 如果数据达到千兆字节,flink 会将所有内容存储在内存中直到时间窗口结束吗?(apply WindowFunction 实现的参数之一是 Iterable - 时间窗口内所有数据的集合)

谢谢

推荐答案

你的问题包含几个方面,但让我回答最根本的一个:

Your question contains several aspects, but let me answer the most fundamental one:

这对 Flink 来说是一项艰巨的任务,为什么这不是一个标准的例子?

是的,中位数是一个很难的概念,因为确定它的唯一方法是保留完整数据.

Yes, the median is a hard concept, as the only way to determine it is to keep the full data.

许多统计数据不需要计算完整数据.例如:

Many statistics don't need the full data to be calculated. For instance:

  • 如果你有总和,你可以把之前的总和加上最新的观察值.
  • 如果您有总数,则加 1 并获得新的总数
  • 如果您有平均值,则可以在后台跟踪总和和计数,并随时根据观察计算新的平均值.

这甚至可以使用更复杂的指标来完成,例如标准偏差.

This can even be done with more complicated metrics, like the standard deviation.

然而,确定中位数没有捷径,在添加一个新的观测值后知道中位数是多少的唯一方法是查看所有观测值,然后找出中间值.

However, there is no shortcut for determining the median, the only way to know what the median is after adding a new observation, is by looking at all observations and then figuring out what the middle one is.

因此,这是一个具有挑战性的指标,需要处理传入的数据大小.如前所述,工作中可能会有这样的估计:https://issues.apache.org/jira/browse/FLINK-2147

As such, it is a challenging metric and the size of the data that comes in will need to be handled. As mentioned there may be estimates in the workings like this: https://issues.apache.org/jira/browse/FLINK-2147

或者,您可以查看数据的分布方式,或许还可以使用均值、偏斜和峰度等指标来估计中位数.

Alternately, you could look at how your data is distributed, and perhaps estimate the median with metrics like Mean, Skew, and Kurtosis.

我能想出的最终解决方案是,如果您需要大致了解该值应该是多少,则选择一些候选者"并计算它们下方的观测值.最接近 50% 的那个将是一个合理的估计.

A final solution I could come up with, is if you need to know approximately what the value should be, is to pick a few 'candidates' and count the fractin of observations below them. The one closest to 50% would then be a reasonable estimate.

这篇关于flink 计算流中的中值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆