Apache Beam 中的多线程:在单独的线程中读取文件 [英] Multithreading in Apache Beam : Reading Files in Seperate Threads

查看:31
本文介绍了Apache Beam 中的多线程:在单独的线程中读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要创建单独的线程来读取多个文件.

We have a requirement to create separate threads for reading multiple files.

  1. 线程 1 可以读取文件 1 并创建 PCollection.我可以在多线程环境中执行 Pardo Operation 吗?并创建一个 PCollection <String,String > 来自 PCollection<;字符串 >?
  2. 线程 2 并在不同的文件 File 2 上完成与线程 1 相同的操作.
  3. 在线程 1 和线程 2 操作完成后,在主线程中加入 File1 和 File 2 的输出.
  1. Thread 1 can read file 1 and create PCollection<String>. Can I execute a Pardo Operation in a multithreaded environment. and create a PCollection < String,String > from PCollection< String >?
  2. Thread 2 and complete the same operation from Thread 1 but on a different file File 2.
  3. Join output of File1 and File 2 in the main thread after Thread 1 and Thread 2 operation is completed.

您能告诉我这是否可行吗?这是一种推荐的方法吗?

Could you please tell whether this is possible and it is a recommended approach?

推荐答案

听起来你想要的东西都可以用 Beam 来完成.在 Beam 模型中,您没有定义如何运行您的操作,而是定义您想要执行的什么操作;然后是 Beam,底层运行器负责管理线程.

It sounds like what you want can be done with Beam. In the Beam model, you do not define how you want your operations to run, but rather, what operations you want to perform; then Beam, and the underlying runner takes care of managing threads.

这就是为什么您通常不应该管理自己的线程来读取 Beam 中的文件.您应该使用 TextIO 从纯文本文件中读取,并且 TextIO 模块应该并行读取文件.

That's why you generally shouldn't manage your own threads to read files in Beam. You should use TextIO to read from plain text files, and the TextIO module should read the files in parallel.

在某些情况下,您的文件将无法并行读取:

There are a few cases when your files will not be able to be read in parallel:

  1. 您的文件已压缩.这意味着文件需要同时解压和读取,并且可以同时从不同的偏移量读取.
  2. 您的文件过多(1000 个).如果您有数千或数万个文件,您可能需要使用 TextIO.readAll 而不是普通的 TextIO 实现,因为要跟踪正在运行的数千个文件并行读取会使系统不堪重负.
  1. Your files are compressed. This means that the file needs to be simultaneously decompressed and read, and can be read from different offsets simultaneously.
  2. You have too many files (1000s). If you have thousands or tens of thousands of files, you may want to use TextIO.readAll instead of the normal TextIO implementation, because keeping track of thousands of files that are being read in parallel can overwhelm the system.

如果您使用的是非纯文本文件或其他类型的源,请告诉我.

Let me know if you are using non-plain text files, or other kind of source.

这篇关于Apache Beam 中的多线程:在单独的线程中读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆