检测并排除Pandas数据框中的异常值 [英] Detect and exclude outliers in Pandas data frame

查看:117
本文介绍了检测并排除Pandas数据框中的异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个只有几列的pandas数据框.

I have a pandas data frame with few columns.

现在我知道某些行是基于某个列值的离群值.

Now I know that certain rows are outliers based on a certain column value.

例如

"Vol"列的所有值都在12xx附近,而一个值是4000(异常值).

column 'Vol' has all values around 12xx and one value is 4000 (outlier).

现在,我想排除那些具有Vol列的行.

Now I would like to exclude those rows that have Vol column like this.

因此,从本质上讲,我需要在数据帧上放置一个过滤器,以便我们选择某一列的值在均值(例如,均值的3个标准差)之内的所有行.

So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.

实现此目标的一种优雅方法是什么?

What is an elegant way to achieve this?

推荐答案

如果数据框中有多个列,并且希望删除至少一列中具有异常值的所有行,则以下表达式可以一击即发

If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

说明:

  • 对于每一列,首先要计算 列,相对于列均值和标准差.
  • 然后使用z分数的绝对值,因为方向不 无关紧要,只要它低于阈值即可.
  • all(axis = 1)确保对于每一行,所有列均满足 约束.
  • 最后,此条件的结果用于索引数据帧.
  • For each column, first it computes the Z-score of each value in the column, relative to the column mean and standard deviation.
  • Then is takes the absolute of Z-score because the direction does not matter, only if it is below the threshold.
  • all(axis=1) ensures that for each row, all column satisfy the constraint.
  • Finally, result of this condition is used to index the dataframe.

这篇关于检测并排除Pandas数据框中的异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆