使用百分位数从 pandas 数据框中删除异常值 [英] Removing outliers from pandas data frame using percentile

查看:101
本文介绍了使用百分位数从 pandas 数据框中删除异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在关注此链接以删除异常值,但这里在逻辑上有些错误..

使用百分位数去除 Pandas DataFrame 中的异常值

我有一个数据集,第一列是id",最后一列是label".

这是我的一段代码,我正在删除标签和 id 列,然后附加它:

def processing_data(train_data,test_data):#计算百分位数.低 = .05高 = .95filt_df = train_data.loc[:, train_data.columns != 'id']filt_df= filt_df.loc[:, filt_df.columns != '标签']quant_df = filt_df.quantile([低,高])打印(quant_df)#filtering 基于计算的百分位数的值.为此,请使用按列应用.print("去除异常值前",filt_df,filt_df.shape)train_data1 = filt_df.apply(lambda x: x[(x>=quant_df.loc[low,x.name]) & (x <=quant_df.loc[high,x.name])],axis=0)print("去除离群值后",train_data1,train_data1.shape)打印(train_data1.isnull().sum())train_data1= pd.concat([train_data.loc[:,'id'], train_data1],axis=1)train_data=pd.concat([train_data.loc[:,'label'], train_data1],axis=1)#train_data.dropna(就地=真)#train_data.fillna(0)#test_data.fillna(0)#print(train_data)#print(np.isnan(train_data).any().sum())返回 train_data,test_data

输出:所有行都包含一些 NaN 值,当我这样做时train_data.dropna(inplace=True) 删除所有行.奇怪!!

我该如何解决这个问题?离群值处理后连接id和label列时,感觉哪里有问题?

这是数据集:

id feature0 feature1 feature2 feature3 feature4 feature249 label0 25.20824887 -16.7457484 50.86994402 5.593471686 1.188262678 11 -86.93144987 0.428227194 2.87483597 -8.064850183 6.056867093 22 42.16093367 7.85701304 151.6127571 9.639675583 5.570138511 03 20.66694385 8.680641918 -56.44917913 -9.814779803 -2.382979151 14 35.9466789 4.57373573 -28.16021186 -6.91297056 4.879375409 0

解决方案

当我用你的例子运行你的代码时,我得到了一个 ValueError.我发现这个问题提到了浮动数据帧元素的分位数有不稳定的行为,它要么返回 NaN 要么返回一个 ValueError https://github.com/pandas-dev/pandas/issues/14564.我认为在这种情况下它是 249 列是 int 而其余是浮点数.当我 filt_df = pd.DataFrame(filt_df, dtype=float) 强制所有列浮动时,它运行良好.

每行中的 NaN 是您按低和高过滤时放置的内容.示例中的每一行都至少有一个超出 .05/.95 边界的值(您的数据可能比您想象的要扁平得多).这意味着当你 dropna 并且它默认为 'any' 时,所有行都将被删除.您可以通过将any"更改为all"或其他选项来更改 dropna 的运行方式.可能更好地调整您的上限/下限以更符合您的数据分布.请记住,即使您的界限与每个添加的列都非常排他性,但每行中至少有一个值超出这些界限的可能性越来越大.

I am following this link to remove outliers, but something is logically wrong here..

Remove Outliers in Pandas DataFrame using Percentiles

I have a dataset with first column as "id" and last column as "label".

Here is my piece of code I am removing label and id columns and then appending it:

def processing_data(train_data,test_data):
    #computing percentiles.
    low = .05
    high = .95
    filt_df = train_data.loc[:, train_data.columns != 'id']
    filt_df= filt_df.loc[:, filt_df.columns != 'label']
    quant_df = filt_df.quantile([low, high])
    print(quant_df)

    #filtering values based on computed percentiles. To do that use an apply by columns.
    print("Before removing outlier",filt_df,filt_df.shape)
    train_data1 = filt_df.apply(lambda x: x[(x>=quant_df.loc[low,x.name]) & (x <=quant_df.loc[high,x.name])], axis=0)
    print("After removing outlier,",train_data1,train_data1.shape)
    print(train_data1.isnull().sum())
    train_data1= pd.concat([train_data.loc[:,'id'], train_data1], axis=1)
    train_data=pd.concat([train_data.loc[:,'label'], train_data1], axis=1)
    #train_data.dropna(inplace=True)

    #train_data.fillna(0)
    #test_data.fillna(0)
    #print(train_data)
    #print(np.isnan(train_data).any().sum())
    return train_data,test_data

Output: All the rows contain some NaN values and when I do train_data.dropna(inplace=True) all the rows are dropped. Strange!!

How can I fix this? When I concat id and label column after outlier treatment, I feel something is fishy there?

Here is the dataset:

id  feature0    feature1    feature2    feature3    feature4    feature249  label
0   25.20824887 -16.7457484 50.86994402 5.593471686 1.188262678   1
1   -86.93144987    0.428227194 2.87483597  -8.064850183    6.056867093     2 
2   42.16093367 7.85701304  151.6127571 9.639675583 5.570138511             0
3   20.66694385 8.680641918 -56.44917913    -9.814779803    -2.382979151    1
4   35.9466789  4.57373573  -28.16021186    -6.91297056 4.879375409         0

解决方案

When I ran your code with your example I got an ValueError. I found this issue which mentions that with float dataframe elements quantile has erratic behavior where it either returns NaNs or a ValueError https://github.com/pandas-dev/pandas/issues/14564 . I think in this case it is the 249 column which is int while rest are floats. when I filt_df = pd.DataFrame(filt_df, dtype=float) to force all the columns to floats, it ran fine.

The NaNs in each row are what are put in place when you filtered by low and high. Each row in teh example does have at least one value that was outside your .05/.95 boundaries (your data may be much more flattened out than you think). This means that when you dropna and it defaults to 'any' all rows will be removed. You can change the way dropna operates by changing 'any' to 'all' or other option. Probably better to adjust your upper/lower bounds to be more in line with your data's spread. Remember even though your bounds are pretty exclusive with each added column it becomes more and more likely that there will be at least one value in each row that falls outside those bounds.

这篇关于使用百分位数从 pandas 数据框中删除异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆