删除每列(和对应行)中的异常值 [英] Removing outliers in each column (and corresponding row)

查看:104
本文介绍了删除每列(和对应行)中的异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Numpy数组包含10列和大约200万行.

My Numpy array contains 10 columns and around 2 million rows.

现在,我需要分别分析每列,找到离群值;并从数组中删除整个对应的行.

Now I need to analyze each column separately, find values which are outliers; and delete the entire corresponding row from the array.

所以我将开始分析第0列;在第10,20,100行找到异常值;并删除这些行. 接下来,我将开始分析现在修剪后的数组中的第1列;并应用相同的过程.

So I'd start analyzing column 0; find outliers at Row 10,20,100; and remove these rows. Next I'd start analyzing column 1 in the now trimmed array; and apply the same process.

我当然可以想到一个正常的手动过程(遍历每列,查找离群值,删除行,进入另一列),但是我一直发现Numpy包含一些快速的技巧.完成这样的统计任务.

Of course I can think of a normal manual process to do this (iterate through each column, find indices which are outliers, delete row, proceed to other column), but I've always found that Numpy contains some quick nifty tricks to accomplish statistical tasks like these.

如果您可以详细说明该方法的运行时成本;更好.

And if you could elaborate a bit on the runtime cost of the method; even better.

这里我不限于NumPy库,如果SciPy有帮助,那么使用它就不会有问题.

I'm not restricted to the NumPy library here, if SciPy has something helpful then no issues using it.

谢谢!

推荐答案

两种非常简单的方法,第二种更为复杂:

Two very straightforward approaches, the second with a little more sophistication:

arr = np.random.randn(2e6, 10)

def remove_outliers(arr, k):
    mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
    return arr[np.all(np.abs((arr - mu) / sigma) < k, axis=1)]

def remove_outliers_bis(arr, k):
    mask = np.ones((arr.shape[0],), dtype=np.bool)
    mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
    for j in range(arr.shape[1]):
        col = arr[:, j]
        mask[mask] &= np.abs((col[mask] - mu[j]) / sigma[j]) < k
    return arr[mask]

性能取决于您拥有多少异常值:

Performance depends of how many outliers you have:

In [38]: %timeit remove_outliers(arr, 1)
1 loops, best of 3: 1.13 s per loop

In [39]: %timeit remove_outliers_bis(arr, 1)
1 loops, best of 3: 983 ms per loop

In [40]: %timeit remove_outliers(arr, 2)
1 loops, best of 3: 1.21 s per loop

In [41]: %timeit remove_outliers_bis(arr, 2)
1 loops, best of 3: 1.51 s per loop

当然:

In [42]: np.allclose(remove_outliers(arr, 1), remove_outliers_bis(arr, 1))
Out[42]: True

In [43]: np.allclose(remove_outliers(arr, 2), remove_outliers_bis(arr, 2))
Out[43]: True

我想说第二种方法的复杂性并不能证明其潜在的加速效果,而是YMMV ...

I would say that the complication of the second method does not justify its potential speed-up, but YMMV...

这篇关于删除每列(和对应行)中的异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆