如何在Hive中使用横向视图爆炸以获取XML数据格式? [英] How to use lateral view explode in Hive for XML data format?

查看:74
本文介绍了如何在Hive中使用横向视图爆炸以获取XML数据格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将XML格式的销售数据加载到Hive表中. 以下是数据的一小部分样本.

I am trying to load sales data which is in XML format to the Hive table. Below is a small sample of the data.

我知道,如果我将以下数据分离到多个表中,然后根据需要将它们联接,则可以将其加载到Hive中.但是只是想知道我是否可以将它们加载到单个表中,并且预期的输出应类似于所附的屏幕截图.

I am aware that I can load the data below to Hive if I separate it to several tables and then join them as required. But just wanted to know if I can load them in a single table and the expected output should look like the screenshot attached.

请帮助我使用应该使用的表格结构,以及如何有效地使用侧向视图爆炸选项来实现此目的.

Please help me with the table structure I should use and how can I use the lateral view explode option effectively to achieve this.

样本数据:

  <Store>
    <Version>1.1</Version>
    <StoreId>16695</StoreId>    
    <Bskt>
      <TillNo>4</TillNo>
      <BsktNo>1753</BsktNo>
      <DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
      <OpID>50056</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>29559</GTIN>
        <ItmDsc>CHOCALATE</ItmDsc>
      <ItmProm>
          <PromCD>CM</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>59653</GTIN>
        <ItmDsc>CORN FLAKES</ItmDsc>
      </Itm>
        <Itm>
        <ItmSeq>3</ItmSeq>
        <GTIN>42260</GTIN>
        <ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
        <ItmProm>
          <PromCD>MTSRO</PromCD>
          <OfferID>11766</OfferID>
        </ItmProm>
      </Itm>
    </Bskt>
    <Bskt>
      <TillNo>5</TillNo>
      <BsktNo>1947</BsktNo>
      <DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
      <OpID>50063</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>24064</GTIN>
        <ItmDsc>TOMATOES 2KG</ItmDsc>
        <ItmProm>
          <PromCD>INSTORE</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>81287</GTIN>
        <ItmDsc>ROTHMANS BLUE</ItmDsc>
        <ItmProm>
          <PromCD>TF</PromCD>
        </ItmProm>
      </Itm>
    </Bskt>
  </Store>  

所需的输出

在此处输入图片描述

表结构:

CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
    TBLPROPERTIES (
    "xmlinput.start"="<Store","xmlinput.end"="</Store>"
);

输出: 在此处输入图片描述

在下面的查询中尝试读取数据,它没有以我想要的方式显示结果.

Tried below query to read the data, It is not showing the results in the way i want.

select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
    LATERAL VIEW explode(DateTime) table1 as basket_dtm 
    LATERAL VIEW explode(BsktNo) table2 as basket_number
    LATERAL VIEW explode(TillNo) table3 as till_number;

结果:

在此处输入图片描述

推荐答案

对数组对象的爆炸类似于交叉联接. 因此,如果您有3列,每列包含2个元素的数组,则对所有列应用explode将为您提供8行.

Explode for array object works like cross join. So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.

您不能将一个对象从数组映射到另一个.

You can't map one object from array to another.

实际上,您可以使用posexplode,它为每个元素提供index.您可以根据条件使用它来加入.但是,当您有多列并且每列的数组大小不同时,这很棘手.

Actually you can by using posexplode which gives you index for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.

解决方案

  • 如果要爆炸的列较少并且数组大小相同,请使用posexplode.对于你的情况,这是行不通的.所以
  • 将XML存储为复杂数据类型:将整个XML存储为复杂数据类型(不仅仅是数组),我说的是基于您的xml创建struct. 如果您没有太多复杂的xml,则可以实现.但是,在将文件转换为复杂数据类型时,xmlSerde不如JSONserde好.
  • Use posexplode if you have less column to explode and array size is same. for your case this is not going to work. So
  • Store XML as Complex Data Type : Store your whole XML as a complex data type (not just array), I am talking about creating a struct based on your xml. If you don't have much complex xml, you can achieve this. However xmlSerde is not as good as JSONserde when it comes to converting file to complex data type.

因此,您的情况是最佳解决方案.

  • 将XML转换为JSON.您可以使用NiFi或其他某种技术.
  • 使用JSONserde创建Hive表并加载此文件.
  • 根据您的要求创建视图.
  • Convert your XML to JSON. You can use NiFi or some other technology for that.
  • Create Hive table using JSONserde and load this file.
  • Create a view as per your requirement.

您的XML的JSON

{"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}

如果文件中包含制表符或其他空格,则

JsonSerde可能会出现错误.因此,始终最好将它们删除.

JsonSerde might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.

配置单元表

create external table temp.test_json
(
Version string,
StoreId string,
Bskt array<struct<
                    BsktNo:string,
                    DateTime:string,
                    OpID:string,
                    TillNo:string,
                    Itm:array<struct<
                                        GTIN:string,
                                        ItmDsc:string,
                                        ItmSeq:string,
                                        ItmProm:struct<
                                                        OfferID:string,
                                                        PromCD:string
                                                        >

                                    >
                            >
                >
            >
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table/';

创建视图

SELECT Version,
         StoreId,
         basket.bsktno,
         basket.tillno,
         basket.`datetime`,
         item.itmseq,
         item.itmdsc,
         item.gtin,
         item.itmprom.offerid,
         item.itmprom.promcd
FROM temp.test_json 
lateral view explode(bskt) b AS basket 
lateral view explode(basket.itm) i AS item

这篇关于如何在Hive中使用横向视图爆炸以获取XML数据格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆