如何获取Hive中的数组中的前n个元素 [英] How to get first n elements in an array in Hive

查看:8069
本文介绍了如何获取Hive中的数组中的前n个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用split函数在Hive中创建数组,我怎样才能从数组中获得前n个元素,并且我想通过子数组

代码示例

 从表中选择col1 
where split(col2,',')[0:5]

'[0:5]'看起来很喜欢Python风格,但它在这里不起作用。 >

解决方案

这是一个棘手的问题。

首先从这里

然后将它添加到Hive: add jar / path / to / jars / brickhouse- 0.7.0-SNAPSHOT.jar;



现在创建我们将要使用的两个函数:

CREATE TEMPORARY FUNCTION array_index AS'brickhouse.udf.collect.ArrayIndexUDF';

CREATE TEMPORARY FUNCTION numeric_range AS'brickhouse.udf.collect.NumericRange';



查询内容为:



select a,
n as array_index,
array_index(split(a,','),n)as value_from_Array $ b $ from(选择abc#1,def#2,hij#3 a from a double union所有
从double中选择abc#1,def#2,hij#3,zzz#4a)t1
横向视图numeric_range(length(a)-length(regexp_replace(a ,'',''))+ 1)n1 as n



解释

从双重联合中选择abc#1,def#2,hij#3a
选择abc#1,def#2,hij#3,zzz#4从双



只是选择一些测试数据,在您的情况下用您的表名替换。



lateral view numeric_range(length(a)-length(regexp_replace(a,',',''))+ 1)n1 as n



numeric_range是一个返回给定范围表的UDTF,在这种情况下,我询问了一个范围在0(默认值)和字符串中的元素数作为逗号的数量+ 1)

这样,每一行都会被多重化d由给定列中元素的数量决定。



array_index(split(a,','),n)



这与使用 split(a,',')[n] 完全相似,但配置单元不支持它。

所以我们得到第一个字符串的第n个元素,以便得到以下结果:

abc#1,def#2,hij# 3,zzz#4 0 abc#1
abc#1,def#2,hij#3,zzz#4 1 def#2
abc#1,def#2,hij#3,zzz# 4 2 hij#3
abc#1,def#2,hij#3,zzz#4 3 zzz#4
abc#1,def#2,hij#3 0 abc#1
abc#1,def#2,hij#3 1 def#2
abc#1,def#2,hij#3 2 hij#3



如果您确实需要特定数量的元素(例如5),那么只需使用:

横向视图numeric_range(5)n1 as n


I use split function to create an array in Hive, how can I get the first n elements from the array, and I want to go through the sub-array

code example

select col1 from table
where split(col2, ',')[0:5] 

'[0:5]'looks likes python style, but it doesn't work here.

解决方案

This is a tricky one.
First grab the brickhouse jar from here
Then add it to Hive : add jar /path/to/jars/brickhouse-0.7.0-SNAPSHOT.jar;

Now create the two functions we will be usings :

CREATE TEMPORARY FUNCTION array_index AS 'brickhouse.udf.collect.ArrayIndexUDF';
CREATE TEMPORARY FUNCTION numeric_range AS 'brickhouse.udf.collect.NumericRange';

The query will be :

select a, n as array_index, array_index(split(a,','),n) as value_from_Array from ( select "abc#1,def#2,hij#3" a from dual union all select "abc#1,def#2,hij#3,zzz#4" a from dual) t1 lateral view numeric_range( length(a)-length(regexp_replace(a,',',''))+1 ) n1 as n

Explained :
select "abc#1,def#2,hij#3" a from dual union all select "abc#1,def#2,hij#3,zzz#4" a from dual

Is just selecting some test data, in your case replace this with your table name.

lateral view numeric_range( length(a)-length(regexp_replace(a,',',''))+1 ) n1 as n

numeric_range is a UDTF that returns a table for a given range, in this case, i asked for a range between 0 (default) and the number of elements in string (calculated as the number of commas + 1)
This way, each row will be multiplied by the number of elements in the given column.

array_index(split(a,','),n)

This is exactly like using split(a,',')[n] but hive doesn't support it.
So we get the n-th element for each duplicated row of the initial string resulting in :

abc#1,def#2,hij#3,zzz#4 0 abc#1 abc#1,def#2,hij#3,zzz#4 1 def#2 abc#1,def#2,hij#3,zzz#4 2 hij#3 abc#1,def#2,hij#3,zzz#4 3 zzz#4 abc#1,def#2,hij#3 0 abc#1 abc#1,def#2,hij#3 1 def#2 abc#1,def#2,hij#3 2 hij#3

If you really want a specific number of elements (say 5) then just use :
lateral view numeric_range(5 ) n1 as n

这篇关于如何获取Hive中的数组中的前n个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆