如何在不同的 Flink 算子中访问相同的变量 [英] How to access a same variable in different Flink operators

查看:40
本文介绍了如何在不同的 Flink 算子中访问相同的变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个收藏,例如val m = ConcurrentMap(),通常我可以使用一个以它为参数的方法,不同的线程可以通过相同的m调用该方法.

I have a collection, e.g. val m = ConcurrentMap(), normally I can use a method taking it as parameter and different threads can call the method passing the same m.

在flink中可能是

val s = StreamExecutionEnvironment.getExecutionEnvironment
s.addSource(new MySource(m))
      .map(new MyMap(m))
      .addSink(new MySink(m))

这些参数会被序列化到不同的机器上,似乎不能被不同的操作员共享.我发现 ColocationGroup 可能接近解决方案.这样对吗?怎么做?

These params would be serialized to different machines and seems that it could not be shared by different operators. I found that ColocationGroup maybe close to the solution. Is it right? How to do it?

推荐答案

没有办法在运算符之间共享内存中的数据结构,甚至在 相同 运算符的并行子任务之间,因为操作符的每个实例都可以在单独的 JVM 中运行.

There's no way to share an in-memory data structure between operators, or even between parallelized sub-tasks for the same operator, as each instance of an operator can be running in a separate JVM.

通常,您会弄清楚如何设计工作流程以避免需要共享数据,因为这通常会导致并发性和可扩展性问题.

Normally you'd figure out how to design your workflow to avoid needing to share data, as that will often lead to concurrency and scalability issues.

如果你不能使用数据分区来消除每个子任务看到所有数据的要求,你可以使用广播流来确保一个算子的每个子任务获得相同的数据.

You can use broadcast streams to ensure every sub-task of an operator gets the same data, if you can't use partitioning of the data to remove the requirement that every sub-task sees all the data.

最坏的情况是,您使用一些共享数据存储(Cassandra、HBase 等)来存储此数据映射,但几乎总是可以通过重新设计工作流程来避免这种情况.

Worst case, you use some shared data store (Cassandra, HBase, etc) for this map of data, but almost always you can avoid that by redesigning your workflow.

这篇关于如何在不同的 Flink 算子中访问相同的变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆