使用 sparkxml 从 xml 中提取标签属性 [英] Extracting tag attributes from xml using sparkxml

查看:63
本文介绍了使用 sparkxml 从 xml 中提取标签属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 com.databricks.spark.xml 加载一个 xml 文件,我想使用 sql 上下文读取标签属性.

I am loading a xml file using com.databricks.spark.xml and i want to read a tag attribute using the sql context .

XML:

<Receipt>
<Sale>
<DepartmentID>PR</DepartmentID>
<Tax TaxExempt="false" TaxRate="10.25"/>
</Sale>
</Receipt>

加载文件,

val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Receipt").load("/home/user/sale.xml");
df.registerTempTable("SPtable");

打印架构:

root
 |-- Sale: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- DepartmentID: long (nullable = true)
 |    |    |-- Tax: string (nullable = true)

现在我想从 Tax 中提取标签属性 TaxExempt.我尝试了以下代码,但它给了我错误.

Now i want to extract the tag attribute TaxExempt from Tax.I tried the following code and it is giving me error .

val tax =sqlContext.sql("select Sale.Tax.TaxExempt from SPtable");

错误:

org.apache.spark.sql.AnalysisException: cannot resolve 'Sale.Tax[TaxExempt]' due to data type mismatch: argument 2 requires integral type, however, 'TaxExempt' is of string type.; line 1 pos 7

非常感谢任何帮助.

推荐答案

dataframe 的第一个打印模式,在我的例子中它打印如下,spark xml version 0.3.3

First print schema of the dataframe, in my case it is printed like below with spark xml version 0.3.3

|-- Sale: struct (nullable = true)
|    |-- DepartmentID: string (nullable = true)
|    |-- Tax: struct (nullable = true)
|    |    |-- #VALUE: string (nullable = true)
|    |    |-- @TaxExempt: boolean (nullable = true)
|    |    |-- @TaxRate: double (nullable = true)

然后使用下面的查询来选择xml属性,注册temptable后

Then use the below query to select xml attributes, after registering the temptable

sqlContext.sql("select Sale.Tax['@TaxRate'] as TaxRate from temptable").show();

sqlContext.sql("select Sale.Tax['@TaxRate'] as TaxRate from temptable").show();

结果如下

|税率|

+-----+

|10.25|

从0.4.1开始,我认为属性默认以下划线(_)开头,在这种情况下,查询属性时只需使用_而不是@.

Starting from 0.4.1, i think the attributes by default starts with underscore(_), in this case just use _ instead of @ while querying attributes.

这篇关于使用 sparkxml 从 xml 中提取标签属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆