Hadoop / Hive Collect_list不重复项目 [英] Hadoop/Hive Collect_list without repeating items

查看：336 发布时间：2018/5/31 19:01:12 hadoop hive hiveql

本文介绍了Hadoop / Hive Collect_list不重复项目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

根据这篇文章， Hive 0.12 - Collect_list ，我正在尝试找到Java代码来实现一个UDAF，它将完成这个或类似的功能，但不需要重复序列。

code>返回一个序列 A，A，A，B，B，A，C，C
我想要序列 A，B，A，C 返回。

有人知道Hive 0.12中的一个函数能够完成或编写他们自己的UDAF吗？

一如既往，感谢您的帮助。

解决方案

我在一段时间后遇到类似的问题。我不想写一个完整的 UDAF ，所以我只是用 brickhouse collect 和我自己的 UDF 。假设你有这些数据

我的 UDF 是

  package com.something; 
 
 import java.util.ArrayList; 
导入org.apache.hadoop.hive.ql.exec.UDF; 
 import org.apache.hadoop.io.Text; 
 
 public class RemoveSequentialDuplicates extends UDF {
 public ArrayList< Text>评估（ArrayList< Text> arr）{
 ArrayList< Text> newList = new ArrayList< Text>（）; 
 newList.add（arr.get（0））; 
 for（int i = 1; i< arr.size（）; i ++）{
 
 String front = arr.get（i）.toString（）; 
 String back = arr.get（i-1）.toString（）; 
 
 if（！back.equals（front））{
 newList.add（arr.get（i））; 
} 
} 
返回newList; 
} 
}

然后我的查询是

  add jar /path/to/jar/brickhouse-0.7.1.jar; 
添加jar /path/to/other/jar/duplicates.jar; 
 
创建临时函数remove_seq_dups为'com.something.RemoveSequentialDuplicates'; 
创建临时函数收集为'brickhouse.udf.collect.CollectUDAF'; 
 
从$ db $ b中选择id 
，remove_seq_dups（value_array）no_dups 
从db.table中收集（value）value_array 
 
 group by id）x

输出

  1 [A，B，A，C，D] 
 2 [D，F A，W，A]

另外，内置 collect_list 不需要按照它们分组的顺序保存列表中的元素; brickhouse collect will。希望这有助于。

Based on the post, Hive 0.12 - Collect_list, I am trying to locate Java code to implement a UDAF that will accomplish this or similar functionality but without a repeating sequence.

For instance, collect_all() returns a sequence A, A, A, B, B, A, C, C I would like to have the sequence A, B, A, C returned. Sequentially repeated items would be removed.

Does anyone know of a function in Hive 0.12 that will accomplish or has written their own UDAF?

As always, thanks for the help.

解决方案

I ran into a similar problem awhile back. I didn't want to have to write a full-on UDAF so I just did a combo with brickhouse collect and my own UDF. Say you have this data

id  value
1   A
1   A
1   A
1   B
1   B
1   A
1   C
1   C
1   D
2   D
2   D
2   D
2   D
2   F
2   F
2   F
2   A
2   W
2   A

my UDF was

package com.something;

import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class RemoveSequentialDuplicates extends UDF {
    public ArrayList<Text> evaluate(ArrayList<Text> arr) {
        ArrayList<Text> newList = new ArrayList<Text>();
        newList.add(arr.get(0));
        for (int i=1; i<arr.size(); i++) {

            String front = arr.get(i).toString();
            String back = arr.get(i-1).toString();

            if (!back.equals(front)) {
                newList.add(arr.get(i));
            }
        }
        return newList;
    }
}

and then my query was

add jar /path/to/jar/brickhouse-0.7.1.jar;
add jar /path/to/other/jar/duplicates.jar;

create temporary function remove_seq_dups as 'com.something.RemoveSequentialDuplicates';
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select id
  , remove_seq_dups(value_array) no_dups
from (
  select id
    , collect(value) value_array
  from db.table
  group by id ) x

output

1   ["A","B","A","C","D"]
2   ["D","F","A","W","A"]

As an aside, the built-in collect_list will not necessary keep the elements of the list in the order they were grouped in; brickhouse collect will. hope this helps.

这篇关于Hadoop / Hive Collect_list不重复项目的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Hadoop / Hive Collect_list不重复项目 [英] Hadoop/Hive Collect_list without repeating items

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

Hadoop / Hive Collect_list不重复项目 [英] Hadoop/Hive Collect_list without repeating items

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭