在星火XML处理 [英] Xml processing in Spark
问题描述
场景:
我的投入将是多个小的XML的和我应该读取这些XML的作为RDDS。执行与其他数据集加盟,形成一个RDD并发送输出的XML。
Scenario: My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.
是否有可能利用火花读取XML,加载数据RDD?如果有可能将如何将XML被读取。
Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.
示例XML:
<root><users> <user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number> <\user> <user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number> <\user><\users><\root>
这将如何被加载到RDD?
How will this be loaded into the RDD?
推荐答案
这是可能的,但细节将取决于你采取的做法有所不同。
Yes it possible but details will differ depending on an approach you take.
- 如果文件很小,因为你已经提到的,最简单的办法是使用加载数据
SparkContext.wholeTextFiles
。它加载数据RDD [(字符串,字符串)]
,其中第一个元素是路径和第二个文件的内容。然后,解析每个文件就像在本地操作模式。 - Hadoop的输入格式可以使用的文件。
- 如果结构简单,你可以使用拆分记录
textinputformat.record.delimiter
。你可以在这里找到一个简单的例子。输入不是XML,但你应该让你和你的想法如何继续 - 否则亨利马乌提供<一个href=\"https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java\"相对=nofollow>
XmlInputFormat
- If files are small, as you've mentioned, the simplest solution is to load your data using
SparkContext.wholeTextFiles
. It loads data asRDD[(String, String)]
where the the first element is path and the second file content. Then you parse each file individually like in a local mode. - For larger files you can use Hadoop input formats.
- If structure is simple you can split records using
textinputformat.record.delimiter
. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed - Otherwise Mahout provides
XmlInputFormat
最后,它可以读取文件中使用
SparkContext.textFile
,后来调整为备案分区之间跨越。概念上,它意味着类似于创建滑动窗口或分割记录成固定大小组的内容:Finally it is possible to read file using
SparkContext.textFile
and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:- 使用
mapPartitionsWithIndex
分区识别分区之间打破纪录,收集打破纪录 - 使用第二个
mapPartitionsWithIndex
来修复损坏的记录
- use
mapPartitionsWithIndex
partitions to identify records broken between partitions, collect broken records - use second
mapPartitionsWithIndex
to repair broken records
修改
也有比较新的
火花CSV
软件包,允许你提取按标签特定的记录:There is also relatively new
spark-csv
package which allows you to extract specific records by tag:val df = sqlContext.read .format("com.databricks.spark.xml") .option("rowTag", "foo") .load("bar.xml")
这篇关于在星火XML处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- If structure is simple you can split records using
- 如果结构简单,你可以使用拆分记录