使用大量内存管理状态 - 从存储中查询 [英] Manage state with huge memory usage - querying from storage

查看:24
本文介绍了使用大量内存管理状态 - 从存储中查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果这听起来很愚蠢,请道歉!我们正在使用 flink 进行异步 IO 调用.很多时候,IO 调用是重复的(相同的参数集)并且我们调用的大约 80% 的 API 对相同的参数返回相同的响应.因此,我们希望避免再次拨打电话.我们认为我们可以使用 state 来存储以前的响应并再次使用它们.问题是,虽然我们的响应可以再次使用,但此类响应的数量很大,因此需要大量内存.有没有办法在需要时将其持久化以驱动和查询?

Apologies if this sounds dumb! We are working with flink to make async IO calls. A lot of the times, the IO calls are repeated (same set of parameters) and about 80% of the APIs that we call return the same response for the same parameters. So, we would like to avoid making the calls again. We thought we could use state to store previous responses and use them again. The issue is that though our responses can be used again, the number of such responses is huge and therefore requires a lot of memory. Is there a way to persist this to drive and query as and when required?

推荐答案

这根本不是一个愚蠢的问题!

Not a dumb question at all!

一些事实揭示了为什么这并不简单:

A few facts reveal why this isn't straightforward:

  1. Flink 状态对于单个操作员来说是严格本地的.您无法在其他操作员中访问状态.
  2. Flink 提供了一种可以溢出到磁盘的状态后端,即 RocksDB.RocksDB 中仅存储键状态 - 非键状态始终存在于堆中.
  3. 异步 i/o 运算符不能用于键控流 - 它只能在非键控上下文中使用.
  4. 将迭代(作业图中的循环连接)与 DataStream API 结合使用是一个非常糟糕的主意(因为它会破坏检查点).

当然,可能不需要缓存处于 Flink 的托管状态.

Of course, it may not be necessary for the cache be in Flink's managed state.

一些选项:

  • 不要对缓存使用键控状态.您可以为缓存使用一个单独的 RocksDB 实例之类的东西,并直接在 async i/o 操作符中实现缓存.如果缓存适合内存,我建议使用 Guava.
  • 不要使用异步 i/o.按照@YuvalItzchakov 的建议,在 ProcessFunction 中自行执行获取和缓存操作.
  • 您可以改用有状态函数.这是一个位于 Flink 之上的新库和 API,克服了上面列出的一些限制.
  • 您可以构建如下图所示的内容.这里缓存在 CoProcessFunction 中保持键控状态.如果缓存未命中,则使用下游异步 i/o 运算符来获取丢失的数据.然后必须使用外部队列(例如 Kafka、Kinesis 或 Pulsar)将其循环回缓存.
  • Don't use keyed state for the cache. You could use something like a separate RocksDB instance for the cache, and implement the caching directly in the async i/o operator. If the cache would fit in memory, I'd suggest Guava.
  • Don't use async i/o. Do the fetching and caching yourself in a ProcessFunction, as proposed by @YuvalItzchakov.
  • You could use Stateful Functions instead. This is a new library and API that sits on top of Flink and overcomes some of the limitations listed above.
  • You could build something like the diagram below. Here the cache is held in keyed state in a CoProcessFunction. If the cache misses, a downstream async i/o operator is used to fetch the missing data. This then has to be looped back to the cache using an external queue, such as Kafka, or Kinesis, or Pulsar.
                    +---------------------+                                       +------+
                    |                     +--results from cache+---------------^--> SINK |
+--requests+------> |  CoProcessFunction  |                                    |  +------+
                    |                     |                                    |
+--cache misses+--> |  cache in RocksDB   |                    +-----------+   |
                    |                     +--side output:      | fetch via +---+-> loop back
     SOURCES        +---------------------+  cache misses+---> | async i/o |       as 2nd input
                                                               +-----------+       to fill cache

这篇关于使用大量内存管理状态 - 从存储中查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆