如何使用Spark RDD解析文本文件中的嵌套XML? [英] How to parse nested XML inside textfile using Spark RDD?
问题描述
我有一个像这样的xml
I have an xml like:
1234^12^999^`<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>`^23232
我们可以使用scala XML支持甚至使用databricks xml格式轻松解析普通的xml文件,但是如何解析嵌入在文本中的xml.
We can parse normal xml file easily using scala XML support or even using databricks xml format, but how do I parse the xml embedded inside text.
可以使用以下方法单独提取XML数据:
XML data alone can be extracted using:
val top5duration = data.map(line => line.split("^")).filter(line => {line(2)==100}).map(line => line(4))
但是,如果我想为每个键"提取值怎么办?
But how do I proceed if i want to extract values for each 'key?
推荐答案
我尝试在RDD级别不使用xplode(数据帧)的情况下解析提到的数据.请提出任何改进建议.
I tried parsing mentioned data without using xplode (dataframe) in RDD level. Please suggest any improvements.
- 将数据读取为文本文件并定义架构
- 使用定界符^ 分割字符串
- 过滤掉不授予架构的错误记录
- 将数据与之前定义的架构进行匹配.
-
现在,您将在元组中获得如下所示的数据,然后我们将解析中间的xml数据.
- Read the data as text file and define a schema
- split string using delimiter ^
- filter out bad records which don't confer to schema
- match the data against the schema defined earlier.
Now you will have data like below in a tuple and we are left to parse the middle xml data.
(1234,12,999,"<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>, 23232)
xml.attribute("key"),因为它将返回所有密钥.
xml.attribute("key") as it will either return all the keys.
这篇关于如何使用Spark RDD解析文本文件中的嵌套XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!