如何从Apache Beam中的HTTP响应读取大文件? [英] How to read large files from HTTP response in Apache Beam?

查看：60 发布时间：2021/4/7 20:56:56 apache-beam apache-beam-io

本文介绍了如何从Apache Beam中的HTTP响应读取大文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Apache Beam的TextIO可用于读取某些文件系统中的JSON文件，但是如何从Java SDK中HTTP响应产生的大型JSON(InputStream)中创建PCollection?

Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?

推荐答案

我不认为Beam中目前有通用的内置解决方案可以做到这一点，

I don't think there's a generic built-in solution in Beam to do this at the moment, see the list of supported IOs.

我可以想到多种解决方法，哪种对您有效，可能取决于您的要求:

I can think of multiple approaches to this, whichever works for you may depend on your requirements:

我可能会首先尝试构建另一层(可能不在Beam中)，将HTTP输出保存到GCS存储桶中(可能在该过程中将其拆分为多个文件)，然后使用Beam的TextIO从GCS存储桶中读取；

取决于HTTP源的属性，您可以考虑:
- 编写自己的 ParDo 一步读取整个响应，将其拆分并分别输出拆分后的元素.然后进一步的转换将解析JSON或做其他事情；
- 实施您自己的源文件，这会更加复杂，但可能可以很好地处理非常大的响应(无界)；
- I would probably first try to build another layer (probably not in Beam) that saves the HTTP output into a GCS bucket (maybe splitting it into multiple files in the process) and then use Beam's TextIO to read from the GCS bucket;
- depending on the properties of the HTTP source you can consider:
  - writing your own ParDo that reads the whole response in a single step, splits it and outputs the split elements separately. Then further transforms would parse the JSON or do other stuff;
  - implementing you own source, that will be more complicated but probably work better for very large (unbounded) responses;
  这篇关于如何从Apache Beam中的HTTP响应读取大文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从Apache Beam中的HTTP响应读取大文件? [英] How to read large files from HTTP response in Apache Beam?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从Apache Beam中的HTTP响应读取大文件? [英] How to read large files from HTTP response in Apache Beam?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭