如何找到路径流并使用猪或蜂巢对它们进行排名? [英] how to find the pathing flow and rank them using pig or hive?

查看:27
本文介绍了如何找到路径流并使用猪或蜂巢对它们进行排名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我的用例的示例.

Below is the example for my use case.

推荐答案

可以参考 这个问题 OP 提出了类似的问题.如果我正确理解您的问题,您希望从路径中删除重复项,但前提是它们彼此相邻.所以 1 ->1 ->2 ->1 会变成 1 ->2 ->1.如果这是正确的,那么您不能只是分组和 distinct(我相信您已经注意到),因为它会删除 所有 重复项.一个简单的解决方案是编写一个 UDF 来删除这些重复项,同时保留用户的不同路径.

You can reference this question where an OP was asking something similar. If I am understanding your problem correctly, you want to remove duplicates from the path, but only when they occur next to each other. So 1 -> 1 -> 2 -> 1 would become 1 -> 2 -> 1. If this is correct, then you can't just group and distinct (as I'm sure you have noticed) because it will remove all duplicates. An easy solution is to write a UDF to remove those duplicates while preserving the distinct path of the user.

UDF:

package something;

import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class RemoveSequentialDuplicatesUDF extends UDF {
    public ArrayList<Text> evaluate(ArrayList<Text> arr) {
        ArrayList<Text> newList = new ArrayList<Text>();
        newList.add(arr.get(0));
        for (int i = 1; i < arr.size(); i++) {

            String front = arr.get(i).toString();
            String back  = arr.get(i-1).toString();

            if (!back.equals(front)) {
                newList.add(arr.get(i));
            }
        }
        return newList;
    }
}

要构建这个 jar,你需要一个 hive-core.jarhadoop-core.jar,你可以在 Maven 存储库.确保您获得在您的环境中使用的 Hive 和 Hadoop 版本.另外,如果您打算在生产环境中运行它,我建议向 UDF 添加一些异常处理.jar 构建完成后,将其导入并运行此查询:

To build this jar you will need a hive-core.jar and hadoop-core.jar, you can find these here in the Maven Repository. Make sure you get the version of Hive and Hadoop that you are using in your environment. Also, if you plan to run this in a production environment, I'd suggest adding some exception handling to the UDF. After the jar is built, import it and run this query:

查询:

add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";

select screen_flow, count
  , dense_rank() over (order by count desc) rank
from (
  select screen_flow
    , count(*) count
  from (
    select session_id
      , concat_ws("->", remove_dups(screen_array)) screen_flow
    from (
      select session_id
        , collect(screen_name) screen_array
      from (
        select *
        from database.table
        order by screen_launch_time ) a
      group by session_id ) b
    ) c
  group by screen_flow ) d

输出:

s1->s2->s3      2       1
s1->s2          1       2
s1->s2->s3->s1  1       2

希望这会有所帮助.

这篇关于如何找到路径流并使用猪或蜂巢对它们进行排名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆