如何在Python中删除离群值? [英] How to remove Outliers in Python?

查看:714
本文介绍了如何在Python中删除离群值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从我的数据集火车"中删除离群值,为此,我决定使用z分数或IQR.

I want to remove outliers from my dataset "train" for which purpose I've decided to use z-score or IQR.

我正在SQL Server的Microsoft Python客户端上运行Jupyter笔记本.

I'm running Jupyter notebook on Microsoft Python Client for SQL Server.

我尝试过使用z得分:

from scipy import stats
train[(np.abs(stats.zscore(train)) < 3).all(axis=1)]

对于IQR:

Q1 = train.quantile(0.02)
Q3 = train.quantile(0.98)
IQR = Q3 - Q1
train = train[~((train < (Q1 - 1.5 * IQR)) |(train > (Q3 + 1.5 * 
IQR))).any(axis=1)]

...返回...

z得分:

TypeError:/的不支持的操作数类型:"str"和"int"

TypeError: unsupported operand type(s) for /: 'str' and 'int'

对于IQR:

TypeError:不可排序的类型:str()< float()

TypeError: unorderable types: str() < float()

我的火车数据集如下:

# Number of each type of column
print('Training data shape: ', train.shape)
train.dtypes.value_counts()

训练数据形状:(300000,111)int32 66 float64 30 object 15 dtype:int64

Training data shape: (300000, 111) int32 66 float64 30 object 15 dtype: int64

我们将不胜感激.

推荐答案

由于尝试在分类列上计算zscore,您的代码遇到了麻烦.

You're having trouble with your code because you're trying to calculate zscore on categorical columns.

为避免这种情况,您应该首先将火车分成具有数字和分类特征的部分:

To avoid this, you should first separate your train into parts with numerical and categorical features:

num_train = train.select_dtypes(include=["number"])
cat_train = train.select_dtypes(exclude=["number"])

并且仅在此之后计算要保留的行索引:

and only after that calculate index of rows to keep:

idx = np.all(stats.zscore(num_train) < 3, axis=1)

最后将这两部分加在一起:

and finally add the two pieces together:

train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

对于IQR部分:

Q1 = num_train.quantile(0.02)
Q3 = num_train.quantile(0.98)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

如果您还有其他疑问,请告诉我们.

Please let us know if you have any further questions.

PS

同样,您可以考虑使用 pandas.DataFrame.clip ,它将根据情况裁剪异常值,而不是完全删除行.

As well, you might consider one more approach for dealing with outliers with pandas.DataFrame.clip, which will clip outliers on a case-by-case basis instead of dropping a row altogether.

这篇关于如何在Python中删除离群值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆