使用百分位数删除Pandas DataFrame中的异常值 [英] Remove Outliers in Pandas DataFrame using Percentiles
问题描述
我有一个包含40列和许多记录的DataFrame df.
I have a DataFrame df with 40 columns and many records.
df:
User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39
对于除user_id列之外的所有列,如果要显示异常值,我想检查异常值并删除整个记录.
For each column except the user_id column I want to check for outliers and remove the whole record, if an outlier appears.
对于每行的异常值检测,我决定只使用第5个和第95个百分位数(我知道这不是最佳的统计方法):
For outlier detection on each row I decided to simply use 5th and 95th percentile (I know it's not the best statistical way):
编码到目前为止的内容:
Code what I have so far:
P = np.percentile(df.Col1, [5, 95])
new_df = df[(df.Col1 > P[0]) & (df.Col1 < P[1])]
问题:如何在不手工操作的情况下将这种方法应用于所有列(User_id
除外)?我的目标是获取一个没有异常记录的数据框.
Question: How can I apply this approach to all columns (except User_id
) without doing this by hand? My goal is to get a dataframe without records that had outliers.
谢谢!
推荐答案
初始数据集.
print(df.head())
Col0 Col1 Col2 Col3 Col4 User_id
0 49 31 93 53 39 44
1 69 13 84 58 24 47
2 41 71 2 43 58 64
3 35 56 69 55 36 67
4 64 24 12 18 99 67
首先删除User_id
列
filt_df = df.loc[:, df.columns != 'User_id']
然后,计算百分位数.
low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)
Col0 Col1 Col2 Col3 Col4
0.05 2.00 3.00 6.9 3.95 4.00
0.95 95.05 89.05 93.0 94.00 97.05
接下来基于计算的百分位数过滤值.为此,我使用apply
by列,就是这样!
Next filtering values based on computed percentiles. To do that I use an apply
by columns and that's it !
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) &
(x < quant_df.loc[high,x.name])], axis=0)
将User_id
带回来.
filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)
最后,具有NaN
值的行可以像这样删除.
Last, rows with NaN
values can be dropped simply like this.
filt_df.dropna(inplace=True)
print(filt_df.head())
User_id Col0 Col1 Col2 Col3 Col4
1 47 69 13 84 58 24
3 67 35 56 69 55 36
5 9 95 79 44 45 69
6 83 69 41 66 87 6
9 87 50 54 39 53 40
检查结果
print(filt_df.head())
User_id Col0 Col1 Col2 Col3 Col4
0 44 49 31 NaN 53 39
1 47 69 13 84 58 24
2 64 41 71 NaN 43 58
3 67 35 56 69 55 36
4 67 64 24 12 18 NaN
print(filt_df.describe())
User_id Col0 Col1 Col2 Col3 Col4
count 100.000000 89.000000 88.000000 88.000000 89.000000 89.000000
mean 48.230000 49.573034 45.659091 52.727273 47.460674 57.157303
std 28.372292 25.672274 23.537149 26.509477 25.823728 26.231876
min 0.000000 3.000000 5.000000 7.000000 4.000000 5.000000
25% 23.000000 29.000000 29.000000 29.500000 24.000000 36.000000
50% 47.000000 50.000000 40.500000 52.500000 49.000000 59.000000
75% 74.250000 69.000000 67.000000 75.000000 70.000000 79.000000
max 99.000000 95.000000 89.000000 92.000000 91.000000 97.000000
如何生成测试数据集
np.random.seed(0)
nb_sample = 100
num_sample = (0,100)
d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
df = DataFrame.from_dict(d)
这篇关于使用百分位数删除Pandas DataFrame中的异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!