是否可以使用数据流将 pubsub 消息删除重复的 pubsub 消息? [英] Deduplicate pubsub messages back to pubsub with dataflow possible?

查看:34
本文介绍了是否可以使用数据流将 pubsub 消息删除重复的 pubsub 消息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个将数据写入 Google Cloud pubsub 的应用程序,根据 pubsub 的文档,由于重试机制而导致重复是偶尔发生的事情.还有就是pubsub也不能保证的乱序消息.

I have an application writing data to Google Cloud pubsub and as per the documentation of pubsub, duplicates due to retry mechanism is something that can happen once in a while. There is also the issue of out-of-order messages which is also not guaranteed in pubsub.

此外,根据文档,可以使用 Google Cloud Dataflow 对这些消息进行重复数据删除.

Also per documentation, it is possible to use Google Cloud Dataflow to deduplicate these messages.

我想让这些消息在消息队列(意味着云发布订阅)中可用,供服务使用,云数据流似乎有一个发布订阅者,但是你不会回到与写入完全相同的问题吗?pubsub 可以创建重复项吗?这难道不是与订单相同的问题吗?如何使用 pubsub(或任何其他系统)按顺序流式传输消息?

I want to make those messages available in a messaging queue (meaning cloud pubsub) for services to consume and cloud Dataflow does seem to have a pubsubio writer however wouldn't you be getting back to the exactly the same problem where writing to pubsub can create duplicates? Wouldn't that also be the same issue with order? How can I stream messages in order using pubsub (or any other system for that matter)?

是否可以使用云数据流从发布订阅主题读取并写入另一个发布订阅并保证没有重复?如果不是,您将如何实现支持相对少量数据的流式传输?

Is it possible to use cloud dataflow to read from a pubsub topic and write to another pubsub with guarantees of no duplicates? If not how else would you do this that supports streaming for a relatively small amount of data?

此外,我对 Apache 光束/云数据流非常陌生.这样一个简单的用例会是什么样子?我想我可以使用 pubsub 本身生成的 ID 进行重复数据删除,因为我让 pubsub 库进行其内部重试而不是自己进行重试,因此重试时 ID 应该相同.

Also I am very new to Apache beam/Cloud Dataflow. How would such a simple use case look like? I suppose I can deduplicate using the ID generated by pubsub itself, as I am letting the pubsub library do its internal retry rather than do it myself so the ID should be the same on retries.

推荐答案

Cloud Dataflow/Apache Beam 是 mac 卡车.它们专为大型数据源/流的并行化而设计.您可以向 PubSub 发送大量数据,但检测重复项不是 Beam 的工作,因为该任务需要序列化.

Cloud Dataflow / Apache Beam are mac trucks. They are designed for parallelization of large data sources / streams. You can send huge amounts of data to PubSub but detecting duplicates is not a job for Beam as this task needs to be serialized.

阅读 PubSub 然后写入不同的主题并不能消除重复的问题,因为在您写入的新主题上可能会发生重复.此外,队列写入的并行化进一步增加了乱序消息的问题.

Reading PubSub and then writing to a different topic does not remove the issue of duplicates as duplicates can happen on the new topic that you are writing to. Also, parallelization of queue writes further increases your issue of out of order messages.

重复的问题需要在从订阅读取的客户端解决.一个简单的数据库查询可以让您知道一个项目已经被处理.然后您只需丢弃该消息.

The problem with duplicates needs to be solved on the client side that reads from the subscription. A simple database query can let you know that an item has already been processed. Then you just discard the message.

处理乱序消息也必须设计到您的应用程序中.

Handling out of sequence messages must be designed into your application also.

PubSub 被设计成一个轻量级的廉价消息队列系统.如果您需要保证消息排序、无重复、FIFO 等,您将需要使用不同的解决方案,这当然要贵得多.

PubSub is designed to be a lightweight inexpensive message queue system. If you need guaranteed message ordering, no duplicates, FIFO, etc. you will need to use a different solution which of course is much more expensive.

这篇关于是否可以使用数据流将 pubsub 消息删除重复的 pubsub 消息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆