移除异常值(+/- 3 std)并在Python/pandas中替换为np.nan [英] Remove outliers (+/- 3 std) and replace with np.nan in Python/pandas

查看:413
本文介绍了移除异常值(+/- 3 std)并在Python/pandas中替换为np.nan的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到了几种可以解决我的问题的解决方案

I have seen several solutions that come close to solving my problem

link1 link2

但是到目前为止,他们还没有帮助我成功.

but they have not helped me succeed thus far.

我相信我需要以下解决方案,但仍然会遇到错误(并且我没有信誉点对此进行评论/问题):

I believe that the following solution is what I need, but continue to get an error (and I don't have the reputation points to comment/question on it): link

(我收到以下错误,但在管理以下命令df2=df.groupby('install_site').transform(replace)时,我不知道在.copy()或添加"inplace=True"的位置:

(I get the following error, but I don't understand where to .copy() or add an "inplace=True" when administering the following command df2=df.groupby('install_site').transform(replace):

SettingWithCopyWarning: 试图在DataFrame的切片副本上设置一个值. 尝试改用.loc[row_indexer,col_indexer] = value

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

请参阅文档中的警告:链接

See the the caveats in the documentation: link

所以,我试图提出自己的版本,但我一直陷于困境.来了.

SO, I have attempted to come up with my own version, but I keep getting stuck. Here goes.

我有一个按时间索引的数据帧,其中包含站点列(许多不同站点的字符串值)和浮点值.

I have a data frame indexed by time with columns for site (string values for many different sites) and float values.

time_index            site       val

我想浏览"val"列(按地点分组),并用NaN(每组)替换所有离群值(与平均值相差+/- 3个标准差).

I would like to go through the 'val' column, grouped by site, and replace any outliers (those +/- 3 standard deviations from the mean) with a NaN (for each group).

使用以下函数时,无法使用我的True/Falses向量索引数据帧:

When I use the following function, I cannot index the data frame with my vector of True/Falses:

def replace_outliers_with_nan(df, stdvs):
    dfnew=pd.DataFrame()
    for i, col in enumerate(df.sites.unique()):
        dftmp = pd.DataFrame(df[df.sites==col])
        idx = [np.abs(dftmp-dftmp.mean())<=(stdvs*dftmp.std())] #boolean vector of T/F's
        dftmp[idx==False]=np.nan  #this is where the problem lies, I believe
        dfnew[col] = dftmp
    return dfnew

此外,我担心上面的函数在700万以上的行上会花费很长时间,这就是为什么我希望使用groupby函数选项的原因.

In addition, I fear the above function will take a very long time on 7 million+ rows, which is why I was hoping to use the groupby function option.

推荐答案

如果我对您的理解正确,则无需遍历各列.该解决方案用NaN替换所有偏差超过三个组标准偏差的所有值.

If I have understood you right, there is no need to iterate over the columns. This solution replaces all values which deviates more than three group standard deviations with NaN.

def replace(group, stds):
    group[np.abs(group - group.mean()) > stds * group.std()] = np.nan
    return group

# df is your DataFrame
df.loc[:, df.columns != group_column] = df.groupby(group_column).transform(lambda g: replace(g, 3))

这篇关于移除异常值(+/- 3 std)并在Python/pandas中替换为np.nan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆