我可以强制将数据流管道中的步骤强制为单线程(并在一台计算机上)吗? [英] Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

本文介绍了我可以强制将数据流管道中的步骤强制为单线程(并在一台计算机上)吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个管道,该管道获取文件的URL,并下载这些文件,以生成除标题之外的每一行的BigQuery表行.

I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.

为避免重复下载,我想对照以前下载的表检查URL,并且仅在URL尚未存储在此历史记录"表中的情况下继续存储它.

To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.

要执行此操作,我需要将历史记录存储在允许唯一值的数据库中,或者也可以更轻松地使用BigQuery进行访问,但是对表的访问必须严格按顺序进行.

For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.

是否可以仅在部分管道上强制执行单线程执行(在一台计算机上)?

Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?

(此后,我的100个URL/文件中的每个都适合在单独的线程上进行处理;每个文件都会产生10000-10000000行,因此在这一点上进行节流几乎可以肯定不会带来性能问题. )

(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)

推荐答案

Beam专为数据的并行处理而设计,它试图显式阻止您进行同步或阻止,除非使用一些内置原语,例如合并.

Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.

听起来像您想要的是一个仅在首次看到该元素时会发出一个元素(您的URL)的过滤器.您可能可以使用内置的

It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.

这篇关于我可以强制将数据流管道中的步骤强制为单线程(并在一台计算机上)吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆