如何使用Spark RDD解析文本文件中的嵌套XML? [英] How to parse nested XML inside textfile using Spark RDD?

查看：104 发布时间：2021/4/8 20:17:41 apache-spark

本文介绍了如何使用Spark RDD解析文本文件中的嵌套XML?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我有一个像这样的xml

I have an xml like:

1234^12^999^`<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>`^23232

我们可以使用scala XML支持甚至使用databricks xml格式轻松解析普通的xml文件，但是如何解析嵌入在文本中的xml.

We can parse normal xml file easily using scala XML support or even using databricks xml format, but how do I parse the xml embedded inside text.

可以使用以下方法单独提取XML数据:

XML data alone can be extracted using:

val top5duration = data.map(line => line.split("^")).filter(line => {line(2)==100}).map(line => line(4))

但是，如果我想为每个键"提取值怎么办?

But how do I proceed if i want to extract values for each 'key?

我尝试在RDD级别不使用xplode(数据帧)的情况下解析提到的数据.请提出任何改进建议.

I tried parsing mentioned data without using xplode (dataframe) in RDD level. Please suggest any improvements.

Read the data as text file and define a schema
split string using delimiter ^
filter out bad records which don't confer to schema
match the data against the schema defined earlier.
Now you will have data like below in a tuple and we are left to parse the middle xml data.

(1234,12,999,"<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>, 23232)

xml.attribute("key")，因为它将返回所有密钥.

xml.attribute("key") as it will either return all the keys.