在Hive中爆炸一行XML数据 [英] Exploding a row of XML data in Hive

查看:96
本文介绍了在Hive中爆炸一行XML数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们已将XML数据作为名为XML的单个字符串列加载到Hadoop中.我们正在尝试将数据归一化或将其归为一类以进行处理(您知道像表一样!).尝试过explode函数,但未完全得到我们想要的.

We have XML data loaded into Hadoop as a single string column named XML. We are trying to retrieve to levels of data an normalize or explode it into single rows for processing (you know, like a table!) Have tried explode function, but not getting exactly what we want.

<Reports>
<Report ID="1">
<Locations>
  <Location ID="20001">
    <LocationName>Irvine Animal Shelter</LocationName>
  </Location>   
  <Location ID="20002">
    <LocationName>Irvine City Hall</LocationName>
  </Location>   
  </Locations>
</Report>
<Report ID="2">
<Locations>
  <Location ID="10001">
    <LocationName>California Fish Grill</LocationName>
  </Location>   
  <Location ID="10002">
    <LocationName>Fukada</LocationName>
  </Location>   
  </Locations>
</Report>
</Reports>

查询1

我们正在查询更高级别的Report.Id,然后查询子级的ID和名称(位置/位置).下面从本质上给了我们所有可能组合的笛卡尔乘积(在此示例中,是8行,而不是我们希望的4行.)

QUERY 1

We are querying the higher level Report.Id and then the id and name from the child (Locations/Location). The following gives us essentially a cartesian product of all possible combinations (in this example, 8 rows instead of the 4 we are hoping for.)

SELECT xpath_int(xml, '/Reports/Report/@ID') AS id, location_id, location_name 
FROM   xmlreports 
LATERAL VIEW explode(xpath(xml, '/Reports/Report/Locations/Location/@ID')) myTable1 AS location_id 
LATERAL VIEW explode(xpath(xml, '/Reports/Report/Locations/Location/LocationName/text()')) myTable2 AS location_name;

查询2

试图组合成一个结构然后爆炸,但这会返回两行和两个数组.

QUERY 2

Tried to group into a structure and then explode, but this returns two rows and two arrays.

SELECT id, loc.col1, loc.col2
FROM (
SELECT xpath_int(xml, '/Reports/Report/@ID') AS id, 
       array(struct(xpath(xml, '/Reports/Report/Locations/Location/@ID'), xpath(xml,     '/Reports/Report/Locations/Location/LocationName/text()'))) As foo
FROM   xmlreports) x
LATERAL VIEW explode(foo) exploded_table as loc;

结果

1   ["20001","20002"]       ["Irvine Animal Shelter","Irvine City Hall"]
2   ["10001","10002"]       ["California Fish Grill","Irvine Spectrum"]

我们想要的是

1   "20001" "Irvine Animal Shelter"
1   "20002" "Irvine City Hall"
2   "10001" "California Fish Grill"
2   "10002" "Irvine Spectrum"

似乎很想做,但却找不到任何示例.任何帮助,我们将不胜感激.

Seems like a common thing to want to do, but can't find any examples. Any help is greatly appreciated.

推荐答案

我看到了两种解决此问题的方法.

I see two ways to solve this problem.

  1. 创建自定义UDF,它将解析一个XML元素并返回所需的数组.在爆炸阵列之后.

  1. Create custom UDF which will parse one XML element and return array that you need. After that explode array.

使用子选择.

我使用子选择实现了解决方案2.即使在使用子选择时,Hive还是足够聪明",为此仅创建一个map-reduce作业,因此我认为您不会遇到性能问题.

I implemented solution 2 using subselects. Even when using subselects Hive is "smart enough" to create only one map-reduce job for this, so I think you will not have performance problems.

SELECT 
 l2.key,
 l2.rid,
 l2.location_id,
 location_name
FROM (
 SELECT 
  l1.key as key,  
  l1.rid as rid, 
  location_id as location_id,
  l1.xml as xml
 FROM (
   SELECT key, xml, rid
   FROM xmlreports
   LATERAL VIEW explode(xpath(xml, '/Reports/Report/@ID')) rids as rid
 ) l1
 LATERAL VIEW explode(xpath(l1.xml, concat('/Reports/Report[@ID = ',l1.rid, ']/Locations/Location/@ID'))) locids as location_id
) l2
LATERAL VIEW explode(xpath(l2.xml, concat('/Reports/Report[@ID = ',l2.rid, ']/Locations/Location[@ID = ', l2.location_id ,' ]/LocationName/text()'))) locnames as location_name;

在XML文件上运行此查询后,您提供了要搜索的结果

After running this query on XML file you provided I got results that you are searching for

1   1   20001   Irvine Animal Shelter
1   1   20002   Irvine City Hall
1   2   10001   California Fish Grill
1   2   10002   Fukada

希望这可以解决您的问题.

Hope this solves your problem.

关于, 恐龙

这篇关于在Hive中爆炸一行XML数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆