Hadoop / Hive Collect_list不重复项目 [英] Hadoop/Hive Collect_list without repeating items
问题描述
根据这篇文章, Hive 0.12 - Collect_list ,我正在尝试找到Java代码来实现一个UDAF,它将完成这个或类似的功能,但不需要重复序列。
code>返回一个序列 A,A,A,B,B,A,C,C
我想要序列
A,B,A,C
返回。 有人知道Hive 0.12中的一个函数能够完成或编写他们自己的UDAF吗?
一如既往,感谢您的帮助。
我在一段时间后遇到类似的问题。我不想写一个完整的 UDAF
,所以我只是用 brickhouse collect 和我自己的 UDF
。假设你有这些数据
id值
1 A
1 A
1 A
1 B
1 B
1 A
1 C
1 C
1 D
2 D
2 D
2 D
2 D
2 F
2 F
2 F
2 A
2 W
2 A
我的 UDF
是
package com.something;
import java.util.ArrayList;
导入org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class RemoveSequentialDuplicates extends UDF {
public ArrayList< Text>评估(ArrayList< Text> arr){
ArrayList< Text> newList = new ArrayList< Text>();
newList.add(arr.get(0));
for(int i = 1; i< arr.size(); i ++){
String front = arr.get(i).toString();
String back = arr.get(i-1).toString();
if(!back.equals(front)){
newList.add(arr.get(i));
}
}
返回newList;
}
}
然后我的查询是
add jar /path/to/jar/brickhouse-0.7.1.jar;
添加jar /path/to/other/jar/duplicates.jar;
创建临时函数remove_seq_dups为'com.something.RemoveSequentialDuplicates';
创建临时函数收集为'brickhouse.udf.collect.CollectUDAF';
从$ db $ b中选择id
,remove_seq_dups(value_array)no_dups
从db.table中收集(value)value_array
group by id)x
输出
1 [A,B,A,C,D]
2 [D,F A,W,A]
另外,内置 collect_list
不需要按照它们分组的顺序保存列表中的元素; brickhouse collect
will。希望这有助于。
Based on the post, Hive 0.12 - Collect_list, I am trying to locate Java code to implement a UDAF that will accomplish this or similar functionality but without a repeating sequence.
For instance, collect_all()
returns a sequence A, A, A, B, B, A, C, C
I would like to have the sequence A, B, A, C
returned. Sequentially repeated items would be removed.
Does anyone know of a function in Hive 0.12 that will accomplish or has written their own UDAF?
As always, thanks for the help.
I ran into a similar problem awhile back. I didn't want to have to write a full-on UDAF
so I just did a combo with brickhouse collect and my own UDF
. Say you have this data
id value
1 A
1 A
1 A
1 B
1 B
1 A
1 C
1 C
1 D
2 D
2 D
2 D
2 D
2 F
2 F
2 F
2 A
2 W
2 A
my UDF
was
package com.something;
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class RemoveSequentialDuplicates extends UDF {
public ArrayList<Text> evaluate(ArrayList<Text> arr) {
ArrayList<Text> newList = new ArrayList<Text>();
newList.add(arr.get(0));
for (int i=1; i<arr.size(); i++) {
String front = arr.get(i).toString();
String back = arr.get(i-1).toString();
if (!back.equals(front)) {
newList.add(arr.get(i));
}
}
return newList;
}
}
and then my query was
add jar /path/to/jar/brickhouse-0.7.1.jar;
add jar /path/to/other/jar/duplicates.jar;
create temporary function remove_seq_dups as 'com.something.RemoveSequentialDuplicates';
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select id
, remove_seq_dups(value_array) no_dups
from (
select id
, collect(value) value_array
from db.table
group by id ) x
output
1 ["A","B","A","C","D"]
2 ["D","F","A","W","A"]
As an aside, the built-in collect_list
will not necessary keep the elements of the list in the order they were grouped in; brickhouse collect
will. hope this helps.
这篇关于Hadoop / Hive Collect_list不重复项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!