Hadoop / Hive Collect_list不重复项目 [英] Hadoop/Hive Collect_list without repeating items

查看:336
本文介绍了Hadoop / Hive Collect_list不重复项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据这篇文章, Hive 0.12 - Collect_list ,我正在尝试找到Java代码来实现一个UDAF,它将完成这个或类似的功能,但不需要重复序列。

code>返回一个序列 A,A,A,B,B,A,C,C
我想要序列 A,B,A,C 返回。



有人知道Hive 0.12中的一个函数能够完成或编写他们自己的UDAF吗?



一如既往,感谢您的帮助。

解决方案

我在一段时间后遇到类似的问题。我不想写一个完整的 UDAF ,所以我只是用 brickhouse collect 和我自己的 UDF 。假设你有这些数据

  id值
1 A
1 A
1 A
1 B
1 B
1 A
1 C
1 C
1 D
2 D
2 D
2 D
2 D
2 F
2 F
2 F
2 A
2 W
2 A

我的 UDF

  package com.something; 

import java.util.ArrayList;
导入org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class RemoveSequentialDuplicates extends UDF {
public ArrayList< Text>评估(ArrayList< Text> arr){
ArrayList< Text> newList = new ArrayList< Text>();
newList.add(arr.get(0));
for(int i = 1; i< arr.size(); i ++){

String front = arr.get(i).toString();
String back = arr.get(i-1).toString();

if(!back.equals(front)){
newList.add(arr.get(i));
}
}
返回newList;
}
}

然后我的查询是

  add jar /path/to/jar/brickhouse-0.7.1.jar; 
添加jar /path/to/other/jar/duplicates.jar;

创建临时函数remove_seq_dups为'com.something.RemoveSequentialDuplicates';
创建临时函数收集为'brickhouse.udf.collect.CollectUDAF';

从$ db $ b中选择id
,remove_seq_dups(value_array)no_dups
从db.table中收集(value)value_array

group by id)x

输出

  1 [A,B,A,C,D] 
2 [D,F A,W,A]

另外,内置 collect_list 不需要按照它们分组的顺序保存列表中的元素; brickhouse collect will。希望这有助于。


Based on the post, Hive 0.12 - Collect_list, I am trying to locate Java code to implement a UDAF that will accomplish this or similar functionality but without a repeating sequence.

For instance, collect_all() returns a sequence A, A, A, B, B, A, C, C I would like to have the sequence A, B, A, C returned. Sequentially repeated items would be removed.

Does anyone know of a function in Hive 0.12 that will accomplish or has written their own UDAF?

As always, thanks for the help.

解决方案

I ran into a similar problem awhile back. I didn't want to have to write a full-on UDAF so I just did a combo with brickhouse collect and my own UDF. Say you have this data

id  value
1   A
1   A
1   A
1   B
1   B
1   A
1   C
1   C
1   D
2   D
2   D
2   D
2   D
2   F
2   F
2   F
2   A
2   W
2   A

my UDF was

package com.something;

import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class RemoveSequentialDuplicates extends UDF {
    public ArrayList<Text> evaluate(ArrayList<Text> arr) {
        ArrayList<Text> newList = new ArrayList<Text>();
        newList.add(arr.get(0));
        for (int i=1; i<arr.size(); i++) {

            String front = arr.get(i).toString();
            String back = arr.get(i-1).toString();

            if (!back.equals(front)) {
                newList.add(arr.get(i));
            }
        }
        return newList;
    }
}

and then my query was

add jar /path/to/jar/brickhouse-0.7.1.jar;
add jar /path/to/other/jar/duplicates.jar;

create temporary function remove_seq_dups as 'com.something.RemoveSequentialDuplicates';
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select id
  , remove_seq_dups(value_array) no_dups
from (
  select id
    , collect(value) value_array
  from db.table
  group by id ) x

output

1   ["A","B","A","C","D"]
2   ["D","F","A","W","A"]

As an aside, the built-in collect_list will not necessary keep the elements of the list in the order they were grouped in; brickhouse collect will. hope this helps.

这篇关于Hadoop / Hive Collect_list不重复项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆