用pentaho计算每列的空值数量 [英] Count the number of null value per column with pentaho

查看:109
本文介绍了用pentaho计算每列的空值数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件,其中包含60多个列和2000000行,我试图计算每个变量(每个列)的空值数量,然后对新行求和整个csv中null值的总数.例如,如果我们在输入中得到了这个文件:

I've got a csv file that contain more than 60 columns and 2 000 000 lines, I'm trying to count the number of null value per variable (per column) then to do the sum of that new row to get the number total of null value in the entire csv. For example if we got this file in input:

我们希望在输出中看到另一个文件:

We expect this other file in output:

我知道如何计算每行的空值数量,但是我没有弄清楚如何计算每列的空值数量.

I know how to count the number of null value per line but, I didn't figure out how to count the number of null value per column.

推荐答案

必须有一种更好的方法来做到这一点,但是我制作了一个真正讨厌的JavaScript来完成这项工作.

There has to be a better way to do this, but I made a really nasty JavaScript which does the job.

对于不同的列类型,它有一些问题,因为它没有设置列类型. (应该将所有列设置为整数,但是我不知道这是否可以从JavaScript中实现.)

It has some problems for different column types, as it doesn't set the column type. (It should set all columns to integer, but I don't know if that is possible from JavaScript.)

您必须先运行Identify last row in a stream,然后将其保存到last列中(或更改脚本).

You have to run Identify last row in a stream first, and save it to the column last (or change the script).

var nulls;
var seen;

if (!seen) {
    // Initialize array
    seen = 1;
    nulls = [];
    for (var i = 0; i < getInputRowMeta().size(); i++) {
        nulls[i] = 0;
    }
}

for (var i = 0; i < getInputRowMeta().size(); i++) {
    if (row[i] == null) {
        nulls[i] += 1;
    }
    // Hack to find empty strings
    else if (getInputRowMeta().getValueMeta(i).getType() == 2 && row[i].length() == 0) {
        nulls[i] += 1;
    }
}

// Don't store any values
trans_Status = SKIP_TRANSFORMATION;

// Only store the nulls at the last row
if (last == true) {
    putRow(nulls);
}

这篇关于用pentaho计算每列的空值数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆