Pig FILTER退回我无法计数的空袋子 [英] Pig FILTER returns empty bag that I can't COUNT

查看:86
本文介绍了Pig FILTER退回我无法计数的空袋子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图计算一个数据集中有多少个值与过滤条件匹配,但是当过滤器不匹配任何条目时,我遇到了问题.

I'm trying to count how many values in a data set match a filter condition, but I'm running into issues when the filter matches no entries.

我的data结构中有很多列,但是此示例仅使用三列:key-集合的数据键(非唯一),value-记录的浮点值,nominal_value-代表标称值的浮点数.

There are a lot of columns in my data structure, but there's only three of use for this example: key - data key for the set (not unique), value - float value as recorded, nominal_value - float representing the nominal value.

我们现在的用例是查找比标称值低10%或更多的值的数量.

Our use case right now is to find the number of values that are 10% or more below the nominal value.

我正在做这样的事情:

filtered_data = FILTER data BY value <= (0.9 * nominal_value);
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT(filtered_data.value);
DUMP filtered_count;

在大多数情况下,没有超出标称范围的值,因此filtered_data为空(或为null.不确定如何分辨哪个).这导致filtered_count也为空/空,这是不希望的.

In most cases, there are no values that fall outside of the nominal range, so filtered_data is empty (or null. Not sure how to tell which.). This results in filtered_count also being empty/null, which is not desirable.

filtered_data为空/null时,我该如何构造一个将返回0值的语句?我尝试了一些在网上找到的选项:

How can I construct a statement that will return a value of 0 when filtered_data is empty/null? I've tried a couple of options that I've found online:

-- Extra parens in COUNT required to avoid syntax error
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT((filtered_data.value is null ? {} : filtered_data.value));

其结果是:

Two inputs of BinCond must have compatible schemas. left hand side: #1259:bag{} right hand side: #1261:bag{#1260:tuple(cf#1038:float)}

并且:

filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE (filtered_data.value is null ? 0 : COUNT(filtered_data.value));

结果为空/空.

推荐答案

现在设置的方式,您将丢失有关无效值计数为0的所有键的信息.相反,我建议保留所有键,以便您可以肯定地确认计数为0,而不是通过缺失来推断.为此,只需使用指标,然后SUM即可:

The way you have it set up right now, you will lose information about any keys for which the count of bad values is 0. Instead, I'd recommend preserving all keys, so that you can see positive confirmation that the count was 0, instead of inferring it by absence. To do that, just use an indicator and then SUM that:

data2 =
    FOREACH data
    GENERATE
        key,
        ((value <= 0.9*nominal_value) ? 1 : 0) AS bad;
bad_count = FOREACH (GROUP data2 BY key) GENERATE group, SUM(data2.bad);

这篇关于Pig FILTER退回我无法计数的空袋子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆