Apache Beam - 读取 JSON 和 Stream [英] Apache Beam - Reading JSON and Stream
问题描述
我正在编写 Apache Beam 代码,我必须在其中读取放置在项目文件夹中的 JSON 文件,然后读取数据并对其进行流式处理.
I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.
这是读取 JSON 的示例代码.这是正确的做法吗?
This is the sample code to read JSON. Is this correct way of doing it?
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);
或者我应该使用,
p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))
我只需要阅读下面的 json 文件.从此文件中读取完整的testdata
,然后将其流式传输.
I just need to read the below json file. Read the complete testdata
from this file and then Stream it.
{
"testdata":{
"siteOwner":"xxx",
"siteInfo":{
"siteID":"id_member",
"siteplatform":"web",
"siteType":"soap",
"siteURL":"www",
}
}
}
上面的代码不是读取json文件,而是像
The above code is not reading the json file, it is printing like
lines: ReadMyFile/Read.out [PCollection]
,你能指导我参考示例吗?
, could you please guide me with sample reference?
推荐答案
这是读取 JSON 的示例代码.这是正确的做法吗?
This is the sample code to read JSON. Is this correct way of doing it?
为了快速回答您的问题,是的.您的示例代码是读取包含 JSON 的文件的正确方法,其中文件的每一行都包含一个 JSON 元素.TextIO
输入转换逐行读取文件,因此如果单个 JSON 元素跨越多行,则将无法解析.
To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO
input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.
第二个代码示例具有相同的效果.
The second code sample has the same effect.
上面的代码不是读取json文件,而是像打印
The above code is not reading the json file, it is printing like
预期的打印结果.变量 lines
实际上并不包含文件中的 JSON 字符串.lines
是 String
的 PCollection
;它只是表示应用变换后管道的状态.可以通过应用后续转换来访问管道中的元素.可以在转换的实现中访问实际的 JSON 字符串.
The printed result is expected. The variable lines
does not actually contain the JSON strings in the file. lines
is a PCollection
of String
s; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.
这篇关于Apache Beam - 读取 JSON 和 Stream的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!