Apache Beam-读取JSON和流 [英] Apache Beam - Reading JSON and Stream

查看:161
本文介绍了Apache Beam-读取JSON和流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写Apache Beam代码,在这里我必须读取放置在项目文件夹中的JSON文件,然后读取数据并将其流化.

I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.

这是读取JSON的示例代码.这是正确的方法吗?

This is the sample code to read JSON. Is this correct way of doing it?

PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);

Pipeline p = Pipeline.create(options);

PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);

或者我应该使用

p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))

我只需要阅读下面的json文件.从该文件中读取完整的testdata,然后对其进行流传输.

I just need to read the below json file. Read the complete testdata from this file and then Stream it.

{
"testdata":{
"siteOwner":"xxx",
"siteInfo":{
"siteID":"id_member",
"siteplatform":"web",
"siteType":"soap",
"siteURL":"www",
}
}
}

上面的代码没有读取json文件,它像

The above code is not reading the json file, it is printing like

lines: ReadMyFile/Read.out [PCollection]

,您能给我提供示例参考吗?

, could you please guide me with sample reference?

推荐答案

这是读取JSON的示例代码.这是正确的方法吗?

This is the sample code to read JSON. Is this correct way of doing it?

是的,为了快速回答您的问题.您的示例代码是读取包含JSON的文件的正确方法,其中文件的每一行都包含一个JSON元素. TextIO输入转换逐行读取一个文件,因此,如果单个JSON元素跨越多行,则它将不可解析.

To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.

第二个代码示例具有相同的效果.

The second code sample has the same effect.

上面的代码没有读取json文件,它的打印方式就像

The above code is not reading the json file, it is printing like

预期的打印结果.变量lines实际上在文件中不包含JSON字符串. linesStringPCollection;它只是表示应用转换后管道的状态.可以通过应用后续转换来完成管道中元素的访问.在转换的实现中可以访问实际的JSON字符串.

The printed result is expected. The variable lines does not actually contain the JSON strings in the file. lines is a PCollection of Strings; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.

这篇关于Apache Beam-读取JSON和流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆