计算猪查询中的百分比 [英] Calculating percentage in a pig query

查看:31
本文介绍了计算猪查询中的百分比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 我有一个包含两列的表格(col1:string, col2:boolean)
  • 假设 col1 = "aaa"
  • 对于 col1 = "aaa",有很多 True/False 值col2
  • 我想计算 col1 (aaa) 的 True 值的百分比

输入:

aaa T
aaa F
aaa F
bbb T
bbb T
ccc F
ccc F

输出

COL1   TOTAL_ROWS_IN_INPUT_TABLE   PERCENTAGE_TRUE_IN_INPUT_TABLE
aaa     3                          33%
bbb     2                          100%
ccc     2                          0%

我将如何使用 PIG(拉丁语)来做到这一点?

How would I do this using PIG (LATIN)?

推荐答案

In Pig 0.10 SUM(INPUT.col2) 不起作用并且无法强制转换为布尔值,因为它将 INPUT.col2 视为布尔值包,而 bag 是不是原始类型.另一件事是,如果 col2 的输入数据被指定为布尔值,那么输入的转储没有 col2 的任何值,但将其视为字符数组就可以了.

In Pig 0.10 SUM(INPUT.col2) does not work and casting to boolean is not possible as it treats INPUT.col2 as a bag of boolean and bag is not a primitive type. Another thing is that if the input data for col2 is specified as boolean, than dump of the input does not have any values for the col2, but treating it as a chararray works just fine.

Pig 非常适合此类任务,因为它可以通过使用嵌套在 FOREACH 中的运算符来处理各个组.这是有效的解决方案:

Pig is well suited for this type of tasks as it has means to work with individual groups by using operators nested in a FOREACH. Here is the solution which works:

inpt = load '....' as (col1 : chararray, col2 : chararray);
grp = group inpt by col1; -- creates bags for each value in col1
result = foreach grp {
    total = COUNT(inpt);
    t = filter inpt by col2 == 'T'; --create a bag which contains only T values
    generate flatten(group) as col1, total as  TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total as PERCENTAGE_TRUE_IN_INPUT_TABLE;
};

dump result;

输出:

(aaa,3,33.333333333333336)
(bbb,2,100.0)
(ccc,2,0.0)

这篇关于计算猪查询中的百分比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆