如何阅读来自Apache火花框架XML文件? [英] How to read XML files from apache spark framework?

查看:175
本文介绍了如何阅读来自Apache火花框架XML文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我没碰到过一个小教程使用的火花在这里数据preprocessing:
<一href=\"http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html\">http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html

I did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html

然而,这仅仅讨论关于文本文件的解析。有没有办法从火花系统解析XML文件?

However, this discusses only about textfile parsing. Is there a way to parse xml files from spark system?

推荐答案

我没有用它自己,但你Hadoop的做的方式是一样的。例如,您可以使用<一个href=\"https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html\"相对=nofollow> StreamXmlRecordReader 和处理XML的。你需要记录读者的原因是你想控制记录边界值否则处理每个元素使用,因为它使用LineRecordReader将处理行的默认。这将有助于让自己更熟悉在Hadoop中recordReader的概念。

I have not used it myself, but the way would be same as you do it for hadoop. For example you can use StreamXmlRecordReader and process the xmls. The reason you need a record reader is you would like to control the record boundries for each element processed otherwise the default used would process line because it uses LineRecordReader. It would be helpful to get yourself more familiar with concept of recordReader in hadoop.

和ofcourse你将不得不使用<一个href=\"http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext\"相对=nofollow> SparkContext 的hadoopRDD或hadoopFile方法与选项传递InputFormatClass。柜面Java是你的preferred语言,也存在类似的替代品。

And ofcourse you will have to use SparkContext's hadoopRDD or hadoopFile methods with option to pass a InputFormatClass. Incase java is your preferred language, similar alternatives exist.

这篇关于如何阅读来自Apache火花框架XML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆