如何找到路径流并使用猪或蜂巢对它们进行排名? [英] how to find the pathing flow and rank them using pig or hive?

查看:385
本文介绍了如何找到路径流并使用猪或蜂巢对它们进行排名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我的使用案例。





解决方案

您可以参考这个问题,其中一个OP提出类似的问题。如果我正确理解您的问题,您希望从路径中删除重复项,但只有当它们彼此相邻时才会删除重复项。所以 1 - > 1 - > 2 - > 1 会变成 1 - > 2 - > 1 。如果这是正确的,那么你不能只分组和 distinct (因为我敢肯定你已经注意到了),因为它会删除所有重复项。一个简单的解决方案就是编写一个UDF来删除这些重复项,同时保留用户的不同路径。


$ b

UDF

  package something; 

import java.util.ArrayList;
导入org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class RemoveSequentialDuplicatesUDF extends UDF {
public ArrayList< Text>评估(ArrayList< Text> arr){
ArrayList< Text> newList = new ArrayList< Text>();
newList.add(arr.get(0));
for(int i = 1; i< arr.size(); i ++){

String front = arr.get(i).toString();
String back = arr.get(i-1).toString();

if(!back.equals(front)){
newList.add(arr.get(i));
}
}
返回newList;


$ / code $ / pre

要构建这个jar,你需要一个 hive-core.jar hadoop-core.jar ,你可以在这里找到 Maven Repository 。确保您获得您在环境中使用的Hive和Hadoop版本。另外,如果你打算在生产环境中运行这个,我建议在UDF中添加一些异常处理。在构建jar之后,导入它并运行此查询:

查询

  add jar /path/to/jars/brickhouse-0.7.1.jar; 
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
创建临时函数收集为brickhouse.udf.collect.CollectUDAF;
将临时函数remove_dups创建为something.RemoveSequentialDuplicatesUDF;

select screen_flow,count
,dense_rank()over(order by count desc)rank
from(
select screen_flow
,count(*)count
from(
select session_id
,concat_ws( - >,remove_dups(screen_array))screen_flow
from(
select session_id
,collect( screen_name)screen_array
from(
select *
from database.table
by screen_launch_time)a
by session_id)b
)c
group by screen_flow)d

输出

  s1-> s2-> s3 2 1 
s1-> s2 1 2
s1-> s2- > s3-> s1 1 2

希望这有助于您。


Below is the example for my use case.

解决方案

You can reference this question where an OP was asking something similar. If I am understanding your problem correctly, you want to remove duplicates from the path, but only when they occur next to each other. So 1 -> 1 -> 2 -> 1 would become 1 -> 2 -> 1. If this is correct, then you can't just group and distinct (as I'm sure you have noticed) because it will remove all duplicates. An easy solution is to write a UDF to remove those duplicates while preserving the distinct path of the user.

UDF:

package something;

import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class RemoveSequentialDuplicatesUDF extends UDF {
    public ArrayList<Text> evaluate(ArrayList<Text> arr) {
        ArrayList<Text> newList = new ArrayList<Text>();
        newList.add(arr.get(0));
        for (int i = 1; i < arr.size(); i++) {

            String front = arr.get(i).toString();
            String back  = arr.get(i-1).toString();

            if (!back.equals(front)) {
                newList.add(arr.get(i));
            }
        }
        return newList;
    }
}

To build this jar you will need a hive-core.jar and hadoop-core.jar, you can find these here in the Maven Repository. Make sure you get the version of Hive and Hadoop that you are using in your environment. Also, if you plan to run this in a production environment, I'd suggest adding some exception handling to the UDF. After the jar is built, import it and run this query:

Query:

add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";

select screen_flow, count
  , dense_rank() over (order by count desc) rank
from (
  select screen_flow
    , count(*) count
  from (
    select session_id
      , concat_ws("->", remove_dups(screen_array)) screen_flow
    from (
      select session_id
        , collect(screen_name) screen_array
      from (
        select *
        from database.table
        order by screen_launch_time ) a
      group by session_id ) b
    ) c
  group by screen_flow ) d

Output:

s1->s2->s3      2       1
s1->s2          1       2
s1->s2->s3->s1  1       2

Hope this helps.

这篇关于如何找到路径流并使用猪或蜂巢对它们进行排名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆