具有实例本地缓存+外部REST API调用的Google Dataflow Pipeline [英] Google Dataflow Pipeline with Instance Local Cache + External REST API calls

查看:91
本文介绍了具有实例本地缓存+外部REST API调用的Google Dataflow Pipeline的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们想要构建一个Cloud Dataflow Streaming管道,该管道从Pubsub提取事件,并对每个事件执行多个类似ETL的操作.这些操作之一是,每个事件都有一个 device-id ,需要将其转换为不同的值(我们称其为 mapped-id ),该映射是 device-id-> mapped-id 由外部服务通过REST API提供.相同的 device-id 可能会在多个事件中重复-因此,这些 device-id-> mapped-id 映射可以被缓存并重新使用.由于我们可能在整个高峰时段每秒处理多达3M个事件,因此需要尽可能减少对REST API的调用,并在实际需要调用时对其进行优化.

牢记此设置,我有以下问题.

  • 要优化REST API调用,Dataflow是否提供任何内置的优化(如连接池),或者如果我们决定使用自己的此类机制,是否需要牢记任何限制/限制? p>

  • 我们正在查看一些内存中缓存选项,以在本地缓存映射,其中一些也由本地文件支持.因此,在不影响工作程序中常规Dataflow操作的情况下,这些缓存可以使用多少内存(占总实例内存的一小部分)是否有任何限制?如果我们使用文件支持的缓存,那么每个工作程序上是否都有一个路径可以由应用程序本身安全地用于构建这些文件?

  • 唯一 device-id 的数量可能约为数百万个-因此,并非所有这些映射都可以存储在每个实例中.因此,为了能够更好地利用本地缓存,我们需要在 device-id 和处理它们的工作程序之间获得某种亲和力.在进行此转换之前,我可以根据 device-id 进行分组.如果这样做,是否可以保证同一工人将或多或少地处理相同的 device-id ?如果有某种合理的保证,那么除了第一次调用就可以了,我通常不需要打外部REST API.还是有更好的方法来确保id和worker之间的这种亲和力.

谢谢

解决方案

您可以执行以下操作:

  • 您的DoFn可以具有实例变量,您可以在其中放置缓存.
  • 只要对VM进行多线程访问,就可以对VM本地的缓存使用常规Java静态变量. Guava CacheBuilder在这里可能真的很有帮助.
  • 在工作线程上为临时文件使用常规Java API是安全的(但同样,请注意对文件的多线程/多进程访问,并确保对其进行清理-您可能会找到DoFn @Setup@Teardown方法很有用).
  • 您可以通过设备ID进行GroupByKey;然后,在大多数情况下,至少使用Cloud Dataflow运行程序,同一密钥将由同一工作人员处理(尽管在管道运行时密钥分配可以更改,但通常不会太频繁).不过,您可能需要设置立即触发的开窗/触发策略.

We want to build a Cloud Dataflow Streaming pipeline which ingests events from Pubsub and performs multiple ETL-like operations on each individual event. One of these operations is that each event has a device-id which need to be transformed to a different value (lets call it mapped-id), the mapping from the device-id->mapped-id being provided by an external service over a REST API. The same device-id might be repeated across multiple events - so these device-id->mapped-id mappings can be cached and re-used. Since we might be dealing with as many as 3M events per second at peak through the pipeline, the call to the REST API needs to be reduced as much as possible and also optimized when the calls are actually needed.

With this setup in mind, I have the following questions.

  • To optimize the REST API call, does Dataflow provide any in-built optimisations like connection pooling or if we decide to use our own such mechanisms, are there any restrictions/limitations we need to keep in mind?

  • We are looking at some of the in-memory cache options, to locally cache the mappings, some of which are backed by local files as well. So, are there any restrictions on how much memory (as a fraction of the total instance memory) can these cache use without affecting the regular Dataflow operations in the workers? if we use a file-backed cache, is there a path on each worker which is safe to use by the application itself to build these files on?

  • The number of unique device-id could be in the order of many millions - so not all those mappings can be stored in each instance. So to be able to utilize the local cache better, we need to get some affinity between the device-id and the workers where they are processed. I can do a group-by based on the device-id before the stage where this transform happens. If I do that, is there any guarantee that the same device-id will more or less be processed by the same worker? if there is some reasonable guarantee, then I would not have to hit the external REST API most of the time other than the first call which should be fine. Or is there a better way to ensure such affinity between the ids and the workers.

Thanks

解决方案

Here's a few things you can do:

  • Your DoFn's can have instance variables and you can put the cache there.
  • It's also ok to use regular Java static variables for a cache local to the VM, as long as you properly manage the multithreaded access to it. Guava CacheBuilder might be really helpful here.
  • Using regular Java APIs for temp files on a worker is safe (but again, be mindful of multithreaded / multiprocess access to your files, and make sure to clean them up - you may find the DoFn @Setup and @Teardown methods useful).
  • You can do a GroupByKey by the device id; then, most of the time, at least with the Cloud Dataflow runner, the same key will be processed by the same worker (though key assignments can change while the pipeline runs, but not too frequently usually). You'll probably want to set a windowing/triggering strategy with immediate triggering though.

这篇关于具有实例本地缓存+外部REST API调用的Google Dataflow Pipeline的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆