Apache Flink:计数窗口超时 [英] Apache Flink: Count window with timeout
问题描述
这是一个简单的代码示例来说明我的问题:
Here is a simple code example to illustrate my question:
case class Record( key: String, value: Int )
object Job extends App
{
val env = StreamExecutionEnvironment.getExecutionEnvironment
val data = env.fromElements( Record("01",1), Record("02",2), Record("03",3), Record("04",4), Record("05",5) )
val step1 = data.filter( record => record.value % 3 != 0 ) // introduces some data loss
val step2 = data.map( r => Record( r.key, r.value * 2 ) )
val step3 = data.map( r => Record( r.key, r.value * 3 ) )
val merged = step1.union( step2, step3 )
val keyed = merged.keyBy(0)
val windowed = keyed.countWindow( 3 )
val summed = windowed.sum( 1 )
summed.print()
env.execute("test")
}
这会产生以下结果:
Record(01,6)
Record(02,12)
Record(04,24)
Record(05,30)
正如预期的那样,没有为键03"生成任何结果,因为计数窗口需要 3 个元素,而流中只有两个元素.
As expected, no result is produced for key "03" because the count window expects 3 elements and only two are present in the stream.
我想要的是某种带超时的计数窗口,以便在某个超时后,如果未达到计数窗口预期的元素数量,则使用现有元素生成部分结果.
What I would like is some kind of count window with timeout so that, after a certain timeout, if the number of elements expected by the count window is not reached, a partial result is produced with the existing elements.
在我的示例中,这种行为会在达到超时时生成 Record(03,15).
With this behavior, in my example, a Record(03,15) would be produced when the timeout is reached.
推荐答案
我遵循了 David 和 NIrav 的方法,以下是结果.
I have followed both David's and NIrav's approaches and here are the results.
1) 使用自定义触发器:
这里我颠倒了我最初的逻辑.我没有使用计数窗口",而是使用时间窗口",其持续时间与超时相对应,然后是在处理完所有元素后触发的触发器.
Here I have reversed my initial logic. Instead of using a 'count window', I use a 'time window' with a duration corresponding to the timeout and followed by a trigger that fires when all the elements have been processed.
case class Record( key: String, value: Int )
object Job extends App
{
val env = StreamExecutionEnvironment.getExecutionEnvironment
val data = env.fromElements( Record("01",1), Record("02",2), Record("03",3), Record("04",4), Record("05",5) )
val step1 = data.filter( record => record.value % 3 != 0 ) // introduces some data loss
val step2 = data.map( r => Record( r.key, r.value * 2 ) )
val step3 = data.map( r => Record( r.key, r.value * 3 ) )
val merged = step1.union( step2, step3 )
val keyed = merged.keyBy(0)
val windowed = keyed.timeWindow( Time.milliseconds( 50 ) )
val triggered = windowed.trigger( new CountTriggerWithTimeout( 3, env.getStreamTimeCharacteristic ) )
val summed = triggered.sum( 1 )
summed.print()
env.execute("test")
}
这是触发代码:
import org.apache.flink.annotation.PublicEvolving
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.api.common.state.ReducingState
import org.apache.flink.api.common.state.ReducingStateDescriptor
import org.apache.flink.api.common.typeutils.base.LongSerializer
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.windowing.triggers._
import org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
/**
* A trigger that fires when the count of elements in a pane reaches the given count or a
* timeout is reached whatever happens first.
*/
class CountTriggerWithTimeout[W <: TimeWindow](maxCount: Long, timeCharacteristic: TimeCharacteristic) extends Trigger[Object,W]
{
private val countState: ReducingStateDescriptor[java.lang.Long] = new ReducingStateDescriptor[java.lang.Long]( "count", new Sum(), LongSerializer.INSTANCE)
override def onElement(element: Object, timestamp: Long, window: W, ctx: TriggerContext): TriggerResult =
{
val count: ReducingState[java.lang.Long] = ctx.getPartitionedState(countState)
count.add( 1L )
if ( count.get >= maxCount || timestamp >= window.getEnd ) TriggerResult.FIRE_AND_PURGE else TriggerResult.CONTINUE
}
override def onProcessingTime(time: Long, window: W, ctx: TriggerContext): TriggerResult =
{
if (timeCharacteristic == TimeCharacteristic.EventTime) TriggerResult.CONTINUE else
{
if ( time >= window.getEnd ) TriggerResult.CONTINUE else TriggerResult.FIRE_AND_PURGE
}
}
override def onEventTime(time: Long, window: W, ctx: TriggerContext): TriggerResult =
{
if (timeCharacteristic == TimeCharacteristic.ProcessingTime) TriggerResult.CONTINUE else
{
if ( time >= window.getEnd ) TriggerResult.CONTINUE else TriggerResult.FIRE_AND_PURGE
}
}
override def clear(window: W, ctx: TriggerContext): Unit =
{
ctx.getPartitionedState( countState ).clear
}
class Sum extends ReduceFunction[java.lang.Long]
{
def reduce(value1: java.lang.Long, value2: java.lang.Long): java.lang.Long = value1 + value2
}
}
2) 使用过程函数:
case class Record( key: String, value: Int )
object Job extends App
{
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic( TimeCharacteristic.IngestionTime )
val data = env.fromElements( Record("01",1), Record("02",2), Record("03",3), Record("04",4), Record("05",5) )
val step1 = data.filter( record => record.value % 3 != 0 ) // introduces some data loss
val step2 = data.map( r => Record( r.key, r.value * 2 ) )
val step3 = data.map( r => Record( r.key, r.value * 3 ) )
val merged = step1.union( step2, step3 )
val keyed = merged.keyBy(0)
val processed = keyed.process( new TimeCountWindowProcessFunction( 3, 100 ) )
processed.print()
env.execute("test")
}
将所有逻辑(即窗口化、触发和求和)放入函数中:
With all the logic (i.e., windowing, triggering, and summing) going into the function:
import org.apache.flink.streaming.api.functions._
import org.apache.flink.util._
import org.apache.flink.api.common.state._
case class Status( count: Int, key: String, value: Long )
class TimeCountWindowProcessFunction( count: Long, windowSize: Long ) extends ProcessFunction[Record,Record]
{
lazy val state: ValueState[Status] = getRuntimeContext
.getState(new ValueStateDescriptor[Status]("state", classOf[Status]))
override def processElement( input: Record, ctx: ProcessFunction[Record,Record]#Context, out: Collector[Record] ): Unit =
{
val updated: Status = Option( state.value ) match {
case None => {
ctx.timerService().registerEventTimeTimer( ctx.timestamp + windowSize )
Status( 1, input.key, input.value )
}
case Some( current ) => Status( current.count + 1, input.key, input.value + current.value )
}
if ( updated.count == count )
{
out.collect( Record( input.key, updated.value ) )
state.clear
}
else
{
state.update( updated )
}
}
override def onTimer( timestamp: Long, ctx: ProcessFunction[Record,Record]#OnTimerContext, out: Collector[Record] ): Unit =
{
Option( state.value ) match {
case None => // ignore
case Some( status ) => {
out.collect( Record( status.key, status.value ) )
state.clear
}
}
}
}
这篇关于Apache Flink:计数窗口超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!