Apache Beam中的多线程:在单独的线程中读取文件 [英] Multithreading in Apache Beam : Reading Files in Seperate Threads

查看:87
本文介绍了Apache Beam中的多线程:在单独的线程中读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要创建单独的线程来读取多个文件.

We have a requirement to create separate threads for reading multiple files.

  1. 线程1可以读取文件1并创建 PCollection< String> .我可以在多线程环境中执行 Pardo Operation .并创建 PCollection<来自 PCollection<的字符串,字符串> 字符串> ?
  2. 线程2并完成与线程1相同的操作,但在另一个文件File 2上.
  3. 在线程1和线程2操作完成之后,将File1和File 2的输出加入主线程中.
  1. Thread 1 can read file 1 and create PCollection<String>. Can I execute a Pardo Operation in a multithreaded environment. and create a PCollection < String,String > from PCollection< String >?
  2. Thread 2 and complete the same operation from Thread 1 but on a different file File 2.
  3. Join output of File1 and File 2 in the main thread after Thread 1 and Thread 2 operation is completed.

请问这是否可行,这是推荐的方法吗?

Could you please tell whether this is possible and it is a recommended approach?

推荐答案

听起来好像可以用Beam完成.在Beam模型中,您没有定义要如何运行操作的方法,而是要确定要执行的操作 .然后是Beam,底层运行器负责管理线程.

It sounds like what you want can be done with Beam. In the Beam model, you do not define how you want your operations to run, but rather, what operations you want to perform; then Beam, and the underlying runner takes care of managing threads.

这就是为什么您通常不应该管理自己的线程来读取Beam中的文件的原因.您应该使用 TextIO 读取纯文本文件,并且 TextIO 模块应该并行读取文件.

That's why you generally shouldn't manage your own threads to read files in Beam. You should use TextIO to read from plain text files, and the TextIO module should read the files in parallel.

在某些情况下,您的文件将无法并行读取:

There are a few cases when your files will not be able to be read in parallel:

  1. 您的文件已压缩.这意味着需要同时解压缩和读取文件,并且可以同时从不同的偏移量读取文件.
  2. 文件太多(1000个).如果您有成千上万的文件,则可能要使用 TextIO.readAll 而不是常规的 TextIO 实现,因为要跟踪成千上万的文件并行读取会淹没整个系统.
  1. Your files are compressed. This means that the file needs to be simultaneously decompressed and read, and can be read from different offsets simultaneously.
  2. You have too many files (1000s). If you have thousands or tens of thousands of files, you may want to use TextIO.readAll instead of the normal TextIO implementation, because keeping track of thousands of files that are being read in parallel can overwhelm the system.

让我知道您使用的是非纯文本文件还是其他类型的源文件.

Let me know if you are using non-plain text files, or other kind of source.

这篇关于Apache Beam中的多线程:在单独的线程中读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆