从PySpark中的Column加载XML字符串 [英] Load XML string from Column in PySpark

查看：309 发布时间：2021/4/8 19:28:27 apache-spark xml-parsing pyspark spark-dataframe

本文介绍了从PySpark中的Column加载XML字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个JSON文件，其中的一列是XML字符串.

I have a JSON file in which one of the columns is an XML string.

我尝试提取此字段并在第一步中写入文件，然后在下一步中读取文件.但是每一行都有一个XML标头标记.因此，生成的文件不是有效的XML文件.

I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file.

如何使用PySpark XML解析器('com.databricks.spark.xml')读取此字符串并解析出值?

How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values?

以下内容无效:

tr = spark.read.json( "my-file-path")
trans_xml = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='book').load(tr.select("trans_xml"))

谢谢，拉姆.

推荐答案

尝试使用Hive XPath UDF(

Try Hive XPath UDFs (LanguageManual XPathUDF):

>>> from pyspark.sql.functions import expr
>>> df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression)))

或Python UDF:

or Python UDF:

>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> import xml.etree.ElementTree as ET
>>> schema = ... # Define schema
>>> def parse(s):
...     root = ET.fromstring(s)
        result = ... # Select values
...     return result
>>> df.select(udf(parse, schema)(xml_column))

这篇关于从PySpark中的Column加载XML字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从PySpark中的Column加载XML字符串 [英] Load XML string from Column in PySpark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从PySpark中的Column加载XML字符串 [英] Load XML string from Column in PySpark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭