如何从 Apache Beam 的 HTTP 响应中读取大文件? [英] How to read large files from HTTP response in Apache Beam?

查看:41
本文介绍了如何从 Apache Beam 的 HTTP 响应中读取大文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Beam 的 TextIO 可用于读取某些文件系统中的 JSON 文件,但如何从 Java SDK 中的 HTTP 响应产生的大型 JSON (InputStream) 中创建 PCollection?

Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?

推荐答案

我认为目前 Beam 中没有通用的内置解决方案可以做到这一点,查看支持的 IO 列表.

I don't think there's a generic built-in solution in Beam to do this at the moment, see the list of supported IOs.

我可以想到多种方法,哪种方法适合您可能取决于您的要求:

I can think of multiple approaches to this, whichever works for you may depend on your requirements:

  • 我可能会首先尝试构建另一个层(可能不在 Beam 中),将 HTTP 输出保存到 GCS 存储桶中(可能在此过程中将其拆分为多个文件),然后使用 Beam 的 TextIO 从 GCS 存储桶中读取;
  • 根据 HTTP 源的属性,您可以考虑:
    • 编写您自己的ParDo一步读取整个响应,将其拆分并分别输出拆分的元素.然后进一步的转换将解析 JSON 或做其他事情;
    • 实施您拥有源代码,这会更复杂,但可能更适合非常大(无界)的响应;
    • I would probably first try to build another layer (probably not in Beam) that saves the HTTP output into a GCS bucket (maybe splitting it into multiple files in the process) and then use Beam's TextIO to read from the GCS bucket;
    • depending on the properties of the HTTP source you can consider:
      • writing your own ParDo that reads the whole response in a single step, splits it and outputs the split elements separately. Then further transforms would parse the JSON or do other stuff;
      • implementing you own source, that will be more complicated but probably work better for very large (unbounded) responses;

      这篇关于如何从 Apache Beam 的 HTTP 响应中读取大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆