从 PySpark 中的列加载 XML 字符串 [英] Load XML string from Column in PySpark

查看:34
本文介绍了从 PySpark 中的列加载 XML 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 JSON 文件,其中一列是 XML 字符串.

我尝试在第一步中提取此字段并写入文件,然后在下一步中读取该文件.但是每一行都有一个 XML 标题标记.因此生成的文件不是有效的 XML 文件.

如何使用 PySpark XML 解析器 ('com.databricks.spark.xml') 读取此字符串并解析出值?

以下不起作用:

tr = spark.read.json("my-file-path")trans_xml = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='book').load(tr.select("trans_xml"))

谢谢,拉姆.

解决方案

尝试 Hive XPath UDF (LanguageManual XPathUDF):

<预><代码>>>>从 pyspark.sql.functions 导入 expr>>>df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression)))

或 Python UDF:

<预><代码>>>>从 pyspark.sql.types 导入 *>>>从 pyspark.sql.functions 导入 udf>>>导入 xml.etree.ElementTree 作为 ET>>>schema = ... # 定义架构>>>定义解析:... root = ET.fromstring(s)结果 = ... # 选择值...返回结果>>>df.select(udf(parse, schema)(xml_column))

I have a JSON file in which one of the columns is an XML string.

I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file.

How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values?

The following doesn't work:

tr = spark.read.json( "my-file-path")
trans_xml = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='book').load(tr.select("trans_xml"))

Thanks, Ram.

解决方案

Try Hive XPath UDFs (LanguageManual XPathUDF):

>>> from pyspark.sql.functions import expr
>>> df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression)))

or Python UDF:

>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> import xml.etree.ElementTree as ET
>>> schema = ... # Define schema
>>> def parse(s):
...     root = ET.fromstring(s)
        result = ... # Select values
...     return result
>>> df.select(udf(parse, schema)(xml_column))

这篇关于从 PySpark 中的列加载 XML 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆