如何在Hive中使用横向视图爆炸以获取XML数据格式? [英] How to use lateral view explode in Hive for XML data format?
问题描述
我正在尝试将XML格式的销售数据加载到Hive表中. 以下是数据的一小部分样本.
I am trying to load sales data which is in XML format to the Hive table. Below is a small sample of the data.
我知道,如果我将以下数据分离到多个表中,然后根据需要将它们联接,则可以将其加载到Hive中.但是只是想知道我是否可以将它们加载到单个表中,并且预期的输出应类似于所附的屏幕截图.
I am aware that I can load the data below to Hive if I separate it to several tables and then join them as required. But just wanted to know if I can load them in a single table and the expected output should look like the screenshot attached.
请帮助我使用应该使用的表格结构,以及如何有效地使用侧向视图爆炸选项来实现此目的.
Please help me with the table structure I should use and how can I use the lateral view explode option effectively to achieve this.
样本数据:
<Store>
<Version>1.1</Version>
<StoreId>16695</StoreId>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>29559</GTIN>
<ItmDsc>CHOCALATE</ItmDsc>
<ItmProm>
<PromCD>CM</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>59653</GTIN>
<ItmDsc>CORN FLAKES</ItmDsc>
</Itm>
<Itm>
<ItmSeq>3</ItmSeq>
<GTIN>42260</GTIN>
<ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
<ItmProm>
<PromCD>MTSRO</PromCD>
<OfferID>11766</OfferID>
</ItmProm>
</Itm>
</Bskt>
<Bskt>
<TillNo>5</TillNo>
<BsktNo>1947</BsktNo>
<DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
<OpID>50063</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>24064</GTIN>
<ItmDsc>TOMATOES 2KG</ItmDsc>
<ItmProm>
<PromCD>INSTORE</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>81287</GTIN>
<ItmDsc>ROTHMANS BLUE</ItmDsc>
<ItmProm>
<PromCD>TF</PromCD>
</ItmProm>
</Itm>
</Bskt>
</Store>
所需的输出
表结构:
CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
TBLPROPERTIES (
"xmlinput.start"="<Store","xmlinput.end"="</Store>"
);
输出: 在此处输入图片描述
在下面的查询中尝试读取数据,它没有以我想要的方式显示结果.
Tried below query to read the data, It is not showing the results in the way i want.
select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
LATERAL VIEW explode(DateTime) table1 as basket_dtm
LATERAL VIEW explode(BsktNo) table2 as basket_number
LATERAL VIEW explode(TillNo) table3 as till_number;
结果:
推荐答案
对数组对象的爆炸类似于交叉联接. 因此,如果您有3列,每列包含2个元素的数组,则对所有列应用explode将为您提供8行.
Explode for array object works like cross join. So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.
您不能将一个对象从数组映射到另一个.
You can't map one object from array to another.
实际上,您可以使用posexplode
,它为每个元素提供index
.您可以根据条件使用它来加入.但是,当您有多列并且每列的数组大小不同时,这很棘手.
Actually you can by using posexplode
which gives you index
for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.
解决方案
- 如果要爆炸的列较少并且数组大小相同,请使用
posexplode
.对于你的情况,这是行不通的.所以 - 将XML存储为复杂数据类型:将整个XML存储为复杂数据类型(不仅仅是数组),我说的是基于您的xml创建
struct
. 如果您没有太多复杂的xml,则可以实现.但是,在将文件转换为复杂数据类型时,xmlSerde
不如JSONserde
好.
- Use
posexplode
if you have less column to explode and array size is same. for your case this is not going to work. So - Store XML as Complex Data Type : Store your whole XML as a complex data type (not just array), I am talking about creating a
struct
based on your xml. If you don't have much complex xml, you can achieve this. HoweverxmlSerde
is not as good asJSONserde
when it comes to converting file to complex data type.
因此,您的情况是最佳解决方案.
- 将XML转换为JSON.您可以使用
NiFi
或其他某种技术. - 使用
JSONserde
创建Hive表并加载此文件. - 根据您的要求创建视图.
- Convert your XML to JSON. You can use
NiFi
or some other technology for that. - Create Hive table using
JSONserde
and load this file. - Create a view as per your requirement.
您的XML的JSON
{"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}
如果文件中包含制表符或其他空格,则
JsonSerde
可能会出现错误.因此,始终最好将它们删除.
JsonSerde
might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.
配置单元表
create external table temp.test_json
(
Version string,
StoreId string,
Bskt array<struct<
BsktNo:string,
DateTime:string,
OpID:string,
TillNo:string,
Itm:array<struct<
GTIN:string,
ItmDsc:string,
ItmSeq:string,
ItmProm:struct<
OfferID:string,
PromCD:string
>
>
>
>
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table/';
创建视图
SELECT Version,
StoreId,
basket.bsktno,
basket.tillno,
basket.`datetime`,
item.itmseq,
item.itmdsc,
item.gtin,
item.itmprom.offerid,
item.itmprom.promcd
FROM temp.test_json
lateral view explode(bskt) b AS basket
lateral view explode(basket.itm) i AS item
这篇关于如何在Hive中使用横向视图爆炸以获取XML数据格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!