为 spark 数据框中的特定列应用逻辑 [英] Apply a logic for a particular column in dataframe in spark

查看：23 发布时间：2021/11/14 22:55:00 scala apache-spark dataframe apache-spark-sql

本文介绍了为 spark 数据框中的特定列应用逻辑的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，它是从 mysql 导入的

I have a Dataframe and it has been imported from mysql

dataframe_mysql.show()
+----+---------+-------------------------------------------------------+
|  id|accountid|                                                xmldata|
+----+---------+-------------------------------------------------------+
|1001|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1002|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1003|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1004|    12347|<AccountSetup xmlns:xsi="test"><Customers test="test...|
+----+---------+-------------------------------------------------------+

在xmldata列里面有xml标签，我需要在单独的数据帧中的结构化数据中解析它.

In the xmldata column there is xml tags inside, I need to parse it in a structured data in a seperate dataframe.

以前我将 xml 文件单独放在一个文本文件中，并使用com.databricks.spark.xml"加载到 spark 数据帧中

Previously I had the xml file alone in a text file, and loaded in a spark dataframe using "com.databricks.spark.xml"

 spark-shell --packages com.databricks:spark-xml_2.10:0.4.1, 
 com.databricks:spark-csv_2.10:1.5.0

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 val df = sqlContext.read.format("com.databricks.spark.xml")
 .option("rowTag","Account").load("mypath/Account.xml")

我得到的最终输出是结构化的

the final output I got as structured one

df.show()

 +----------+--------------------+--------------------+--------------+--------------------+-------+....
    |   AcctNbr|         AddlParties|           Addresses|ApplicationInd|       Beneficiaries|ClassCd|....
    +----------+--------------------+--------------------+--------------+--------------------+-------+....
    |AAAAAAAAAA|[[Securities Amer...|[WrappedArray([D,...|             T|[WrappedArray([11...|     35|....
    +----------+--------------------+--------------------+--------------+--------------------+-------+....

当我在数据框中有 xml 内容时，请建议如何实现这一点.

Please advice how to achieve the this when I have the xml content inside a dataframe.

推荐答案

由于您尝试将 XML 数据列拉出到单独的 DataFrame，您仍然可以使用 spark-xml 包中的代码.你只需要直接使用他们的阅读器.

Since you are trying to pull the XML data column out to a separate DataFrame you can still use the code from spark-xml's package. You just need to use their reader directly.

case class Data(id: Int, accountid: Int, xmldata: String)
val df = Seq(
    Data(1001, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers></AccountSetup>"),
    Data(1002, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"b\">e</Customers></AccountSetup>"),
    Data(1003, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"c\">f</Customers></AccountSetup>")
).toDF


import com.databricks.spark.xml.XmlReader

val reader = new XmlReader()

// Set options using methods
reader.withRowTag("AccountSetup")

val rdd = df.select("xmldata").map(r => r.getString(0)).rdd
val xmlDF = reader.xmlRdd(spark.sqlContext, rdd)

但是，从长远来看，philantrovert 建议使用自定义 XML 解析的 UDF 可能会更清晰.阅读器类的参考链接这里

However, a UDF as philantrovert suggests with custom XML parsing would probably be cleaner in the long run. Reference link for the reader class here

这篇关于为 spark 数据框中的特定列应用逻辑的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为 spark 数据框中的特定列应用逻辑 [英] Apply a logic for a particular column in dataframe in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为 spark 数据框中的特定列应用逻辑 [英] Apply a logic for a particular column in dataframe in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭