具有实例本地缓存 + 外部 REST API 调用的 Google 数据流管道 [英] Google Dataflow Pipeline with Instance Local Cache + External REST API calls

查看:23
本文介绍了具有实例本地缓存 + 外部 REST API 调用的 Google 数据流管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们想要构建一个 Cloud Dataflow Streaming 管道,它从 Pubsub 中提取事件并对每个单独的事件执行多个类似 ETL 的操作.其中一个操作是每个事件都有一个 device-id,需要将其转换为不同的值(我们称之为 mapped-id),来自 device-id 的映射em>device-id->mapped-id 由外部服务通过 REST API 提供.相同的 device-id 可能会在多个事件中重复 - 因此这些 device-id->mapped-id 映射可以被缓存和重复使用.由于我们可能在峰值时每秒处理多达 300 万个事件通过管道,因此需要尽可能减少对 REST API 的调用,并在实际需要调用时进行优化.

考虑到这种设置,我有以下问题.

  • 为了优化 REST API 调用,Dataflow 是否提供任何内置优化(如连接池),或者如果我们决定使用我们自己的此类机制,我们是否需要记住任何限制/限制?

  • 我们正在研究一些内存缓存选项,以在本地缓存映射,其中一些也由本地文件支持.那么,对于这些缓存可以使用多少内存(作为总实例内存的一部分)而不影响工作线程中的常规 Dataflow 操作,是否有任何限制?如果我们使用文件支持的缓存,每个工作线程上是否有一个路径可供应用程序本身安全使用以在其上构建这些文件?

  • 唯一 device-id 的数量可能达到数百万 - 因此并非所有这些映射都可以存储在每个实例中.因此,为了能够更好地利用本地缓存,我们需要在 device-id 和处理它们的工作人员之间获得一些关联.在此转换发生的阶段之前,我可以根据 device-id 进行分组.如果我这样做,是否可以保证相同的 device-id 或多或少会由同一个工人处理?如果有一些合理的保证,那么除了第一次调用应该没问题之外,我大部分时间都不必访问外部 REST API.或者有没有更好的方法来确保 id 和工作人员之间的这种亲和力.

谢谢

解决方案

您可以执行以下操作:

  • 您的 DoFn 可以有实例变量,您可以将缓存放在那里.
  • 也可以将常规 Java 静态变量用于 VM 本地缓存,只要您正确管理对它的多线程访问.Guava CacheBuilder 在这里可能真的很有帮助.
  • 对工作人员上的临时文件使用常规 Java API 是安全的(但同样,请注意对文件的多线程/多进程访问,并确保清理它们 - 您可能会发现 DoFn@Setup@Teardown 方法很有用).
  • 你可以通过设备ID做一个GroupByKey;然后,在大多数情况下,至少对于 Cloud Dataflow 运行器,相同的键将由同一个工作器处理(尽管在管道运行时键分配可以更改,但通常不会太频繁).不过,您可能希望设置立即触发的窗口/触发策略.

We want to build a Cloud Dataflow Streaming pipeline which ingests events from Pubsub and performs multiple ETL-like operations on each individual event. One of these operations is that each event has a device-id which need to be transformed to a different value (lets call it mapped-id), the mapping from the device-id->mapped-id being provided by an external service over a REST API. The same device-id might be repeated across multiple events - so these device-id->mapped-id mappings can be cached and re-used. Since we might be dealing with as many as 3M events per second at peak through the pipeline, the call to the REST API needs to be reduced as much as possible and also optimized when the calls are actually needed.

With this setup in mind, I have the following questions.

  • To optimize the REST API call, does Dataflow provide any in-built optimisations like connection pooling or if we decide to use our own such mechanisms, are there any restrictions/limitations we need to keep in mind?

  • We are looking at some of the in-memory cache options, to locally cache the mappings, some of which are backed by local files as well. So, are there any restrictions on how much memory (as a fraction of the total instance memory) can these cache use without affecting the regular Dataflow operations in the workers? if we use a file-backed cache, is there a path on each worker which is safe to use by the application itself to build these files on?

  • The number of unique device-id could be in the order of many millions - so not all those mappings can be stored in each instance. So to be able to utilize the local cache better, we need to get some affinity between the device-id and the workers where they are processed. I can do a group-by based on the device-id before the stage where this transform happens. If I do that, is there any guarantee that the same device-id will more or less be processed by the same worker? if there is some reasonable guarantee, then I would not have to hit the external REST API most of the time other than the first call which should be fine. Or is there a better way to ensure such affinity between the ids and the workers.

Thanks

解决方案

Here's a few things you can do:

  • Your DoFn's can have instance variables and you can put the cache there.
  • It's also ok to use regular Java static variables for a cache local to the VM, as long as you properly manage the multithreaded access to it. Guava CacheBuilder might be really helpful here.
  • Using regular Java APIs for temp files on a worker is safe (but again, be mindful of multithreaded / multiprocess access to your files, and make sure to clean them up - you may find the DoFn @Setup and @Teardown methods useful).
  • You can do a GroupByKey by the device id; then, most of the time, at least with the Cloud Dataflow runner, the same key will be processed by the same worker (though key assignments can change while the pipeline runs, but not too frequently usually). You'll probably want to set a windowing/triggering strategy with immediate triggering though.

这篇关于具有实例本地缓存 + 外部 REST API 调用的 Google 数据流管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆