如何解析Apache Spark中的xml文件? [英] How to parse xml files in Apache Spark?
本文介绍了如何解析Apache Spark中的xml文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何在Apache Spark中解析包含相同节点列表的xml文件?
How can I parse an xml file containing a list of same nodes in Apache Spark?
文件示例:
<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.4.0 (25361 thorn-02.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
<bounds minlat="48.8306100" minlon="2.3310900" maxlat="48.8337900" maxlon="2.3389100"/>
<node id="430785" visible="true" version="8" changeset="24482318" timestamp="2014-08-01T14:24:53Z" user="dhuyp" uid="1779584" lat="48.8340725" lon="2.3309196"/>
<node id="661209" visible="true" version="6" changeset="9914127" timestamp="2011-11-22T21:46:44Z" user="lapinos03" uid="33634" lat="48.8337517" lon="2.3333992"/>
<node id="24912996" visible="true" version="2" changeset="806076" timestamp="2009-03-14T10:38:25Z" user="Goon" uid="24657" lat="48.8302268" lon="2.3338015">
<tag k="crossing" v="uncontrolled"/>
<tag k="highway" v="traffic_signals"/>
</node>
<node id="24912994" visible="true" version="5" changeset="5904801" timestamp="2010-09-28T15:32:01Z" user="maouth-" uid="322872" lat="48.8301333" lon="2.3309869">
<tag k="highway" v="mini_roundabout"/>
</node>
</osm>
推荐答案
如另一个答案中所述,Databricks的spark-xml是读取XML的一种方法,但是
As mentioned in another answer, spark-xml from Databricks is one way to read XML, however there is currently a bug in spark-xml which prevents you from importing self closing elements. To get around this, you can import the entire XML as a single value, and then do something like the following:
val pathToYourData = "Z:/test.xml"
val osm = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "osm").load(pathToYourData)
val nodes = osm.selectExpr("explode(node) as node")
nodes.select("node.*").show
/*
+------+----------+--------+----------+---------+--------------------+-------+---------+--------+--------+--------------------+
|#VALUE|@changeset| @id| @lat| @lon| @timestamp| @uid| @user|@version|@visible| tag|
+------+----------+--------+----------+---------+--------------------+-------+---------+--------+--------+--------------------+
| null| 24482318| 430785|48.8340725|2.3309196|2014-08-01T14:24:53Z|1779584| dhuyp| 8| true| null|
| null| 9914127| 661209|48.8337517|2.3333992|2011-11-22T21:46:44Z| 33634|lapinos03| 6| true| null|
| null| 806076|24912996|48.8302268|2.3338015|2009-03-14T10:38:25Z| 24657| Goon| 2| true|[[null,crossing,u...|
| null| 5904801|24912994|48.8301333|2.3309869|2010-09-28T15:32:01Z| 322872| maouth-| 5| true|[[null,highway,mi...|
+------+----------+--------+----------+---------+--------------------+-------+---------+--------+--------+--------------------+
*/
这篇关于如何解析Apache Spark中的xml文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文