使用NaN在 pandas 中按列对数据进行Winsoring [英] Winsorizing data by column in pandas with NaN

查看:187
本文介绍了使用NaN在 pandas 中按列对数据进行Winsoring的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对熊猫数据框中的几列数据进行解冻.每列都有一些NaN,这会影响winsorization,因此需要将其删除.我知道如何执行此操作的唯一方法是删除所有 all 数据,而不是仅逐列删除它们.

I'd like to winsorize several columns of data in a pandas Data Frame. Each column has some NaN, which affects the winsorization, so they need to be removed. The only way I know how to do this is to remove them for all of the data, rather than remove them only column-by-column.

MWE:

import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize

# Create Dataframe
N, M, P = 10**5, 4, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
                  , index=dates)
df.index.names = ['DATE']
df.columns = ['one','two','three','four']
# Now scale them differently so you can see the winsorization
df['four'] = df['four']*(10**5)
df['three'] = df['three']*(10**2)
df['two'] = df['two']*(10**-1)
df['one'] = df['one']*(10**-4)
# Create NaN
df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan
df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan
df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan

以下是基线分布:

df.quantile([0, 0.01, 0.5, 0.99, 1])

输出:

               one           two      three          four
0.00  2.336618e-10  2.294259e-07   0.002437      2.305353
0.01  9.862626e-07  9.742568e-04   0.975807   1003.814520
0.50  4.975859e-05  4.981049e-02  50.290946  50374.548980
0.99  9.897463e-05  9.898590e-02  98.978263  98991.438985
1.00  9.999983e-05  9.999966e-02  99.996793  99999.437779

这就是我赞美的方式:

def using_mstats(s):
    return winsorize(s, limits=[0.01, 0.01])

wins = df.apply(using_mstats, axis=0)
wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

哪个给这个:

Out[356]:
           one       two      three          four
0.00  0.000001  0.001060   1.536882   1003.820149
0.01  0.000001  0.001060   1.536882   1003.820149
0.25  0.000025  0.024975  25.200378  25099.994780
0.50  0.000050  0.049810  50.290946  50374.548980
0.75  0.000075  0.074842  74.794537  75217.343920
0.99  0.000099  0.098986  98.978263  98991.436957
1.00  0.000100  0.100000  99.996793  98991.436957

four是正确的,因为它没有NaN,但其他列都不正确.第99个百分位数和最大值应该相同.两者的观察值均相同:

Column four is correct because it has no NaN but the others are incorrect. The 99th percentile and Max should be the same. The observations counts are identical for both:

In [357]: df.count()
Out[357]:
one       90700
two       91600
three     63500
four     100000
dtype: int64

In [358]: wins.count()
Out[358]:
one       90700
two       91600
three     63500
four     100000
dtype: int64

这是我可以解决"的方法,但要以丢失很多数据为代价:

This is how I can 'solve' it, but at the cost of losing a lot of my data:

wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0)
wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

输出:

Out[360]:
               one       two      three          four
0.00  9.686203e-07  0.000928   0.965702   1005.209503
0.01  9.686203e-07  0.000928   0.965702   1005.209503
0.25  2.486052e-05  0.024829  25.204032  25210.837443
0.50  4.980946e-05  0.049894  50.299004  50622.227179
0.75  7.492750e-05  0.075059  74.837900  75299.906415
0.99  9.895563e-05  0.099014  98.972310  99014.311761
1.00  9.895563e-05  0.099014  98.972310  99014.311761

In [361]: wins2.count()
Out[361]:
one      51700
two      51700
three    51700
four     51700
dtype: int64

如何在保持数据形状(即不删除行)的同时,按列取消不是NaN的数据?

How can I winsorize the data, by column, that is not NaN, while maintaining the data shape (i.e. not removing rows)?

推荐答案

经常发生的情况是,只需创建MWE即可帮助弄清.我需要将clip()与Quantile()结合使用,如下所示:

As often happens, simply creating the MWE helped clarify. I need to use clip() in combination with quantile() as below:

df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1)
df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

输出:

               one       two      three          four
0.00  9.862626e-07  0.000974   0.975807   1003.814520
0.01  9.862666e-07  0.000974   0.975816   1003.820092
0.25  2.485043e-05  0.024975  25.200378  25099.994780
0.50  4.975859e-05  0.049810  50.290946  50374.548980
0.75  7.486737e-05  0.074842  74.794537  75217.343920
0.99  9.897462e-05  0.098986  98.978245  98991.436977
1.00  9.897463e-05  0.098986  98.978263  98991.438985

In [384]: df2.count()
Out[384]:
one       90700
two       91600
three     63500
four     100000
dtype: int64

数字与上面的数字不同,因为我维护了每列中所有不丢失(NaN)的数据.

The numbers are different from above because I have maintained all of the data in each column that is not missing (NaN).

这篇关于使用NaN在 pandas 中按列对数据进行Winsoring的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆