如何在Spark中处理XML文件? [英] How do I process XML file in spark?

查看:109
本文介绍了如何在Spark中处理XML文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

学习火花和scala.我有一个处理xml文字的代码段.但是,当我尝试从文件中加载xml时,我无法使其工作.可能我缺少一个关键的了解.希望能有所帮助.我正在使用cloudera VM,它的火花为1.6&斯卡拉2.10.5.

Learning spark and scala. I have a snippet that process xml literal. But when I try to load the xml from file, I couldn't make it work. Probably I am missing a key understanding. Would appreciate some help. I am using cloudera VM and it has spark 1.6 & scala 2.10.5.

场景:读取xml,提取id,名称并显示为id @ name.

Scenario: Read xml, extract id, name and display as id@name.

scala> import scala.xml._
scala> val strxml = <employees>
     | <employee><id>1</id><name>chris</name></employee>
     | <employee><id>2</id><name>adam</name></employee>
     | <employee><id>3</id><name>karl</name></employee>
     | </employees>
strxml: scala.xml.Elem = 
<employees>
<employee><id>1</id><name>chris</name></employee>
<employee><id>2</id><name>adam</name></employee>
<employee><id>3</id><name>karl</name></employee>
</employees>

scala> val t = strxml.flatMap(line => line \\ "employee")
t: scala.xml.NodeSeq = NodeSeq(<employee><id>1</id><name>chris</name></employee>, <employee><id>2</id><name>adam</name></employee>, <employee><id>3</id><name>karl</name></employee>)

scala> t.map(l => (l \\ "id").text + "@" + (l \\ "name").text).foreach(println)
1@chris
2@adam
3@karl

从文件加载(抛出异常;我在这里做错了什么?)

Loading it from a file (exception thrown; What am I doing wrong here?)

scala> val filexml = sc.wholeTextFiles("file:///home/cloudera/test*")
filexml: org.apache.spark.rdd.RDD[(String, String)] = file:///home/cloudera/test* MapPartitionsRDD[66] at wholeTextFiles at <console>:30

scala> val lines = filexml.map(line => XML.loadString(line._2))
lines: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[89] at map at <console>:32

scala> val ft = lines.map(l => l \\ "employee")
ft: org.apache.spark.rdd.RDD[scala.xml.NodeSeq] = MapPartitionsRDD[99] at map at <console>:34

scala> ft.map(l => (l \\ "id").text + "@" + (l \\ "name").text).foreach(println)

Exception in task 0.0 in stage 63.0 (TID 63)
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog

文件内容

test.xml

<employees>
<employee><id>1</id><name>chris</name></employee>
<employee><id>2</id><name>adam</name></employee>
<employee><id>3</id><name>karl</name></employee>
</employees>

test2.xml

<employees>
<employee><id>4</id><name>hive</name></employee>
<employee><id>5</id><name>elixir</name></employee>
<employee><id>6</id><name>spark</name></employee>
</employees>

推荐答案

回答我自己的问题.

scala> val filexml = sc.wholeTextFiles("file:///Volumes/BigData/sample_data/test*.xml")
filexml: org.apache.spark.rdd.RDD[(String, String)] = file:///Volumes/BigData/sample_data/test*.xml MapPartitionsRDD[1] at wholeTextFiles at <console>:24

scala> val lines = filexml.flatMap(line => XML.loadString(line._2) \\ "employee")
lines: org.apache.spark.rdd.RDD[scala.xml.Node] = MapPartitionsRDD[3] at flatMap at <console>:29

scala> lines.map(line => (line \\ "id").text + "@" + (line \\ "name").text).foreach(println)
1@chris
2@adam
3@karl
4@hive
5@elixir
6@spark

这篇关于如何在Spark中处理XML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆