Apache Beam - 读取 JSON 和 Stream [英] Apache Beam - Reading JSON and Stream

查看:30
本文介绍了Apache Beam - 读取 JSON 和 Stream的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写 Apache Beam 代码,我必须在其中读取放置在项目文件夹中的 JSON 文件,然后读取数据并对其进行流式处理.

I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.

这是读取 JSON 的示例代码.这是正确的做法吗?

This is the sample code to read JSON. Is this correct way of doing it?

PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);

Pipeline p = Pipeline.create(options);

PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);

或者我应该使用,

p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))

我只需要阅读下面的 json 文件.从此文件中读取完整的testdata,然后将其流式传输.

I just need to read the below json file. Read the complete testdata from this file and then Stream it.

{
"testdata":{
"siteOwner":"xxx",
"siteInfo":{
"siteID":"id_member",
"siteplatform":"web",
"siteType":"soap",
"siteURL":"www",
}
}
}

上面的代码不是读取json文件,而是像

The above code is not reading the json file, it is printing like

lines: ReadMyFile/Read.out [PCollection]

,你能指导我参考示例吗?

, could you please guide me with sample reference?

推荐答案

这是读取 JSON 的示例代码.这是正确的做法吗?

This is the sample code to read JSON. Is this correct way of doing it?

为了快速回答您的问题,是的.您的示例代码是读取包含 JSON 的文件的正确方法,其中文件的每一行都包含一个 JSON 元素.TextIO 输入转换逐行读取文件,因此如果单个 JSON 元素跨越多行,则将无法解析.

To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.

第二个代码示例具有相同的效果.

The second code sample has the same effect.

上面的代码不是读取json文件,而是像打印

The above code is not reading the json file, it is printing like

预期的打印结果.变量 lines 实际上并不包含文件中的 JSON 字符串.linesStringPCollection;它只是表示应用变换后管道的状态.可以通过应用后续转换来访问管道中的元素.可以在转换的实现中访问实际的 JSON 字符串.

The printed result is expected. The variable lines does not actually contain the JSON strings in the file. lines is a PCollection of Strings; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.

这篇关于Apache Beam - 读取 JSON 和 Stream的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆