Apache 光束:状态规范中的 TTL [英] Apache beam: TTL in State Spec

查看:29
本文介绍了Apache 光束:状态规范中的 TTL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在读取 Kinesis 并写入 parquet,我们使用 StateSpec> 来避免在从最后一个保存点正常停止和重新启动我们的管道后重复处理记录.>

我们看到一些记录被复制,因为它们最终会在后续重新启动时落在不同的任务管理器上,我们使用 StateSpec> 来存储有关已处理记录的状态信息并避免重复.

我们正在处理如何每隔特定时间清除状态,而不会丢失最近处理的记录(如果在即将到来的停止中需要它们).(即,我们需要一个类似于该类的 TTL 的东西).

我们想过一个定时器,它每隔一定时间清除一次状态,但不符合我们的要求,因为我们需要保留最近处理过的记录.

我们阅读 此处 使用事件时间处理会在窗口到期后自动清除状态信息,我们想知道这是否符合我们使用 StateSpec 类的要求.

否则,是否有一个存储状态的类具有一种 TTL 来实现此功能?

我们现在拥有的是这段代码,用于检查元素是否已经处理过,以及每隔一定时间清除状态的方法

 @StateId("keyPreserved")private final StateSpec>keyPreserved = StateSpecs.value(BooleanCoder.of());@TimerId("resetStateTimer")private final TimerSpec resetStateTimer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);public void processElement(ProcessContext context,@TimerId("resetStateTimer") 定时器 resetStateTimer,@StateId(keyPreserved") ValueState<Boolean>keyPreservedState) {if (!firstNonNull(keyPreservedState.read(), false)) {T 消息 = context.element().getValue();//这里处理元素keyPreservedState.write(true);}}@OnTimer("resetStateTimer")public void onResetStateTimer(OnTimerContext 上下文,@StateId(keyPreserved") ValueState<Boolean>keyPreservedState) {keyPreservedState.clear();}

解决方案

每次调用 keyPreservedState.write(true); 时设置计时器就足够了.当计时器到期时 keyPreservedState.clear(); 只清除上下文中的元素,而不是整个状态.

We are reading from Kinesis and writing to parquet and we use StateSpec<ValueState<Boolean>> to avoid duplicated processing of records after gracefully stopping and relaunching our pipeline from the last savepoint.

We saw that some records were duplicated because they end up falling on a different task manager on subsequent relaunches, and we use StateSpec<ValueState<Boolean>> to store stateful information about the processed records and avoid duplicates.

We are dealing with how to clear the state every certain time without the risk of losing the most recent processed records if they are needed in an upcoming stop. (i.e, we need a something like a TTL on that class).

We thought about a timer that clears the state every certain time but that doesn't meet our requirements because we need to keep the most recent processed records.

We read here that using event time processing automatically clears State information after a window expires and we would like to know if that fits with our requirement using the StateSpec class.

Otherwise, is there a class to store state that has a kind of TTL to implement this feature?

What we have right now is this piece of code that checks if the element has already processed and a method that clears the state every certain time

    @StateId("keyPreserved")
    private final StateSpec<ValueState<Boolean>> keyPreserved = StateSpecs.value(BooleanCoder.of());
    @TimerId("resetStateTimer")
    private final TimerSpec resetStateTimer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);

    public void processElement(ProcessContext context,
        @TimerId("resetStateTimer") Timer resetStateTimer,
        @StateId("keyPreserved") ValueState<Boolean> keyPreservedState) {
        if (!firstNonNull(keyPreservedState.read(), false)) {
            T message = context.element().getValue();

            //Process element here

            keyPreservedState.write(true);
        }
    }


    @OnTimer("resetStateTimer")
    public void onResetStateTimer(OnTimerContext context,
        @StateId("keyPreserved") ValueState<Boolean> keyPreservedState) {
        keyPreservedState.clear();
    }

解决方案

Setting the timer every time we call keyPreservedState.write(true); was enough. When the timer expires keyPreservedState.clear(); only clears the element in the contexts, not the whole state.

这篇关于Apache 光束:状态规范中的 TTL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆