将父架构列的一部分添加到 spark 数据框中嵌套 json 中的子项 [英] Adding part of the parent Schema column to child in nested json in spark data frame

查看:11
本文介绍了将父架构列的一部分添加到 spark 数据框中嵌套 json 中的子项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下面的 xml,我正试图将其加载到 spark 数据框.

I have below xml that i am trying to load in to spark data frame.

   <?xml version="1.0"?>
<env:ContentEnvelope xsi:schemaLocation="http">
    <env:Header>
        <env:Info>
            <env:Id>urn:uuid:6d2af93bfbfc49da9805aebb6a38996d</env:Id>
            <env:TimeStamp>20171122T07:56:09+00:00</env:TimeStamp>
        </env:Info>
        <fun:OrgId>18227</fun:OrgId>
        <fun:DataPartitionId>1</fun:DataPartitionId>
    </env:Header>
    <env:Body minVers="0.0" majVers="1" contentSet="Fundamental">
        <env:ContentItem action="Overwrite">
            <env:Data xsi:type="sr:FinancialSourceDataItem">
                <sr:Source sourceId="344" organizationId="4295906830">
                    <sr:FilingDateTime>20171111T17:00:00+00:00</sr:FilingDateTime>
                    <sr:SourceTypeCode>10K</sr:SourceTypeCode>
                    <sr:StatementDate>20171030T00:00:00+00:00</sr:StatementDate>
                    <sr:IsFilingDateTimeEstimated>false</sr:IsFilingDateTimeEstimated>
                    <sr:ContainsPreliminaryData>false</sr:ContainsPreliminaryData>
                    <sr:CapitalChangeAdjustmentDate>20171030T00:00:00+00:00</sr:CapitalChangeAdjustmentDate>
                    <sr:CumulativeAdjustmentFactor>1.00000</sr:CumulativeAdjustmentFactor>
                    <sr:ContainsRestatement>false</sr:ContainsRestatement>
                    <sr:FilingDateTimeUTCOffset>300</sr:FilingDateTimeUTCOffset>
                    <sr:ThirdPartySourceCode>SS</sr:ThirdPartySourceCode>
                    <sr:ThirdPartySourcePriority>1</sr:ThirdPartySourcePriority>
                    <sr:Auditors>
                        <sr:Auditor auditorId="3541">
                            <sr:AuditorOpinionCode>UNQ</sr:AuditorOpinionCode>
                            <sr:IsPlayingAuditorRole>true</sr:IsPlayingAuditorRole>
                            <sr:IsPlayingTaxAdvisorRole>false</sr:IsPlayingTaxAdvisorRole>
                            <sr:AuditorEnumerationId>3024068</sr:AuditorEnumerationId>
                            <sr:AuditorOpinionId>3010546</sr:AuditorOpinionId>
                            <sr:IsPlayingCSRAuditorRole>false</sr:IsPlayingCSRAuditorRole>
                        </sr:Auditor>
                        <sr:Auditor auditorId="9574">
                            <sr:AuditorOpinionCode>UWE</sr:AuditorOpinionCode>
                            <sr:IsPlayingAuditorRole>true</sr:IsPlayingAuditorRole>
                            <sr:IsPlayingTaxAdvisorRole>false</sr:IsPlayingTaxAdvisorRole>
                            <sr:AuditorEnumerationId>3030421</sr:AuditorEnumerationId>
                            <sr:AuditorOpinionId>3010547</sr:AuditorOpinionId>
                            <sr:IsPlayingCSRAuditorRole>false</sr:IsPlayingCSRAuditorRole>
                        </sr:Auditor>
                    </sr:Auditors>
                    <sr:SourceTypeId>3011835</sr:SourceTypeId>
                    <sr:ThirdPartySourceCodeId>1000716240</sr:ThirdPartySourceCodeId>
                </sr:Source>
            </env:Data>
        </env:ContentItem>
    </env:Body>
</env:ContentEnvelope>

主要标签是然后有两部分,第一部分是标题(),另一部分是正文()

The main tag is <env:ContentEnvelope> Then there are two part one header (<env:Header>)and other is body (<env:Body)

正文中的详细信息对于 中的所有行都是相同的.env:Body.

The details in the body like <fun:OrgId> and <fun:DataPartitionId> will be same for all the rows in the <env:Body.

由此我想创建两个数据框.

From this i want to create two data frame .

一个用于 ,第二个用于

One for <sr:Source and Second one for <sr:Auditor

对于两个数据框,action="Overwrite" 将与公共列相同.

For both data frames action="Overwrite" will be same as a common column.

还因为 <sr:Auditor<sr:Source 的子项,所以像 sourceId="344" organizationId="4295906830"<这样的列很少/code> 将在 数据框中重复.

Also Because <sr:Auditor is child of <sr:Source so few columns like sourceId="344" organizationId="4295906830" will be repeating in the <sr:Auditor data frame.

这是我迄今为止为实现这一目标所做的

This is what i have done so far to achieve this

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")

    val dfHeader = dfContentEnvelope.withColumn("Header", (dfContentEnvelope("env:Header"))).select("Header.*")
    val dfDataPartitionId =dfHeader.select("fun:DataPartitionId")
    //dfDataPartitionId.show()

   //val dfBody = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:Body").load("s3://trfsmallfffile/XML")
    val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
    val dfType=dfContentItem.select("env:Data.*")
    //dfType.show()

      val srSource = dfType.withColumn("srSource", (dfType("sr:Source"))).select("srSource.*").drop("sr:Auditors").filter($"srSource".isNotNull)
     val srSourceAuditor = dfType.withColumn("srSource", explode(dfType("sr:Source.sr:Auditors.sr:Auditor"))).select("srSource.*")

所以我的问题是如何获取 <sr:Source 的父数据帧和 <sr:Auditor 的子数据帧以及从父级到子级的组织 ID 和 sourceId数据框?

So my question is how can I get Parent dataframe for <sr:Source and child dataframe for <sr:Auditor with organizationId and sourceId from Parent to child dataframe?

推荐答案

如果您希望获得两个数据帧:一个用于 Source,另一个用于 AuditorsSource 数据帧的 organizationIdsourceId ,则可以使用以下逻辑.

If you are looking to get two dataframes: one for the Source and one for the Auditors with organizationId and sourceId of Source dataframe, then you can use following logic.

观察给定的数据和您的尝试,我可以建议 env:Body.env:ContentItem 列上的 explode 函数将为您提供 父数据框

Observing the given data and your attempts, I can suggest that a explode function on env:Body.env:ContentItem column would give you the parent dataframe

import sqlContext.implicits._
import org.apache.spark.sql.functions._
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml")
  .option("rowTag", "env:ContentEnvelope")
  .load("s3://trfsmallfffile/XML")

val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val ParentDF=dfContentItem.select($"env:Data.sr:Source._organizationId".as("organizationId"), $"env:Data.sr:Source._sourceId".as("sourceId"), $"env:Data.sr:Source".as("Source"))

这会给你

+--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|organizationId|sourceId|Source                                                                                                                                                                                                                                                 |
+--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|4295906830    |344     |[4295906830,344,[WrappedArray([3541,3024068,UNQ,3010546,true,false,false], [9574,3030421,UWE,3010547,true,false,false])],20171030T00:00:00+00:00,false,false,1.0,20171111T17:00:00+00:00,300,false,10K,3011835,20171030T00:00:00+00:00,SS,1000716240,1]|
+--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

对于子数据框,您需要将上述父数据框中的sr:Auditor分解为

And for the child dataframe you would need to explode the sr:Auditor from the above parent dataframe as

val childDF=ParentDF.select($"organizationId", $"sourceId", explode($"Source.sr:Auditors.sr:Auditor").as("Auditors"))

应该给你

+--------------+--------+-------------------------------------------+
|organizationId|sourceId|Auditors                                   |
+--------------+--------+-------------------------------------------+
|4295906830    |344     |[3541,3024068,UNQ,3010546,true,false,false]|
|4295906830    |344     |[9574,3030421,UWE,3010547,true,false,false]|
+--------------+--------+-------------------------------------------+

希望回答对你有帮助

这篇关于将父架构列的一部分添加到 spark 数据框中嵌套 json 中的子项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆