使用百分位数删除Pandas DataFrame中的异常值 [英] Remove Outliers in Pandas DataFrame using Percentiles

查看:1029
本文介绍了使用百分位数删除Pandas DataFrame中的异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含40列和许多记录的DataFrame df.

I have a DataFrame df with 40 columns and many records.

df:

User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39

对于除user_id列之外的所有列,如果要显示异常值,我想检查异常值并删除整个记录.

For each column except the user_id column I want to check for outliers and remove the whole record, if an outlier appears.

对于每行的异常值检测,我决定只使用第5个和第95个百分位数(我知道这不是最佳的统计方法):

For outlier detection on each row I decided to simply use 5th and 95th percentile (I know it's not the best statistical way):

编码到目前为止的内容:

Code what I have so far:

P = np.percentile(df.Col1, [5, 95])
new_df = df[(df.Col1 > P[0]) & (df.Col1 < P[1])]

问题:如何在不手工操作的情况下将这种方法应用于所有列(User_id除外)?我的目标是获取一个没有异常记录的数据框.

Question: How can I apply this approach to all columns (except User_id) without doing this by hand? My goal is to get a dataframe without records that had outliers.

谢谢!

推荐答案

初始数据集.

print(df.head())

   Col0  Col1  Col2  Col3  Col4  User_id
0    49    31    93    53    39       44
1    69    13    84    58    24       47
2    41    71     2    43    58       64
3    35    56    69    55    36       67
4    64    24    12    18    99       67

首先删除User_id

filt_df = df.loc[:, df.columns != 'User_id']

然后,计算百分位数.

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

       Col0   Col1  Col2   Col3   Col4
0.05   2.00   3.00   6.9   3.95   4.00
0.95  95.05  89.05  93.0  94.00  97.05

接下来基于计算的百分位数过滤值.为此,我使用apply by列,就是这样!

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
                                    (x < quant_df.loc[high,x.name])], axis=0)

User_id带回来.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

最后,具有NaN值的行可以像这样删除.

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)
print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
1       47    69    13    84    58    24
3       67    35    56    69    55    36
5        9    95    79    44    45    69
6       83    69    41    66    87     6
9       87    50    54    39    53    40

检查结果

print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
0       44    49    31   NaN    53    39
1       47    69    13    84    58    24
2       64    41    71   NaN    43    58
3       67    35    56    69    55    36
4       67    64    24    12    18   NaN

print(filt_df.describe())

          User_id       Col0       Col1       Col2       Col3       Col4
count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000
mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303
std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876
min      0.000000   3.000000   5.000000   7.000000   4.000000   5.000000
25%     23.000000  29.000000  29.000000  29.500000  24.000000  36.000000
50%     47.000000  50.000000  40.500000  52.500000  49.000000  59.000000
75%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000
max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000

如何生成测试数据集

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
    d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)

这篇关于使用百分位数删除Pandas DataFrame中的异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆