Apache Beam-读取JSON和流 [英] Apache Beam - Reading JSON and Stream
问题描述
我正在编写Apache Beam代码,在这里我必须读取放置在项目文件夹中的JSON文件,然后读取数据并将其流化.
I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.
这是读取JSON的示例代码.这是正确的方法吗?
This is the sample code to read JSON. Is this correct way of doing it?
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);
或者我应该使用
p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))
我只需要阅读下面的json文件.从该文件中读取完整的testdata
,然后对其进行流传输.
I just need to read the below json file. Read the complete testdata
from this file and then Stream it.
{
"testdata":{
"siteOwner":"xxx",
"siteInfo":{
"siteID":"id_member",
"siteplatform":"web",
"siteType":"soap",
"siteURL":"www",
}
}
}
上面的代码没有读取json文件,它像
The above code is not reading the json file, it is printing like
lines: ReadMyFile/Read.out [PCollection]
,您能给我提供示例参考吗?
, could you please guide me with sample reference?
推荐答案
这是读取JSON的示例代码.这是正确的方法吗?
This is the sample code to read JSON. Is this correct way of doing it?
是的,为了快速回答您的问题.您的示例代码是读取包含JSON的文件的正确方法,其中文件的每一行都包含一个JSON元素. TextIO
输入转换逐行读取一个文件,因此,如果单个JSON元素跨越多行,则它将不可解析.
To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO
input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.
第二个代码示例具有相同的效果.
The second code sample has the same effect.
上面的代码没有读取json文件,它的打印方式就像
The above code is not reading the json file, it is printing like
预期的打印结果.变量lines
实际上在文件中不包含JSON字符串. lines
是String
的PCollection
;它只是表示应用转换后管道的状态.可以通过应用后续转换来完成管道中元素的访问.在转换的实现中可以访问实际的JSON字符串.
The printed result is expected. The variable lines
does not actually contain the JSON strings in the file. lines
is a PCollection
of String
s; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.
这篇关于Apache Beam-读取JSON和流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!