移除异常值(+/- 3 std)并在Python/pandas中替换为np.nan [英] Remove outliers (+/- 3 std) and replace with np.nan in Python/pandas

查看：413 发布时间：2020/11/21 0:57:15 python grouping outliers

本文介绍了移除异常值(+/- 3 std)并在Python/pandas中替换为np.nan的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经看到了几种可以解决我的问题的解决方案

I have seen several solutions that come close to solving my problem

link1 link2

但是到目前为止，他们还没有帮助我成功.

but they have not helped me succeed thus far.

我相信我需要以下解决方案，但仍然会遇到错误(并且我没有信誉点对此进行评论/问题):

I believe that the following solution is what I need, but continue to get an error (and I don't have the reputation points to comment/question on it): link

(我收到以下错误，但在管理以下命令df2=df.groupby('install_site').transform(replace)时，我不知道在.copy()或添加"inplace=True"的位置:

(I get the following error, but I don't understand where to .copy() or add an "inplace=True" when administering the following command df2=df.groupby('install_site').transform(replace):

SettingWithCopyWarning: 试图在DataFrame的切片副本上设置一个值. 尝试改用.loc[row_indexer,col_indexer] = value

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

请参阅文档中的警告:链接

See the the caveats in the documentation: link

所以，我试图提出自己的版本，但我一直陷于困境.来了.

SO, I have attempted to come up with my own version, but I keep getting stuck. Here goes.

我有一个按时间索引的数据帧，其中包含站点列(许多不同站点的字符串值)和浮点值.

I have a data frame indexed by time with columns for site (string values for many different sites) and float values.

time_index            site       val

我想浏览"val"列(按地点分组)，并用NaN(每组)替换所有离群值(与平均值相差+/- 3个标准差).

I would like to go through the 'val' column, grouped by site, and replace any outliers (those +/- 3 standard deviations from the mean) with a NaN (for each group).

使用以下函数时，无法使用我的True/Falses向量索引数据帧:

When I use the following function, I cannot index the data frame with my vector of True/Falses:

def replace_outliers_with_nan(df, stdvs):
    dfnew=pd.DataFrame()
    for i, col in enumerate(df.sites.unique()):
        dftmp = pd.DataFrame(df[df.sites==col])
        idx = [np.abs(dftmp-dftmp.mean())<=(stdvs*dftmp.std())] #boolean vector of T/F's
        dftmp[idx==False]=np.nan  #this is where the problem lies, I believe
        dfnew[col] = dftmp
    return dfnew

此外，我担心上面的函数在700万以上的行上会花费很长时间，这就是为什么我希望使用groupby函数选项的原因.

In addition, I fear the above function will take a very long time on 7 million+ rows, which is why I was hoping to use the groupby function option.

移除异常值(+/- 3 std)并在Python/pandas中替换为np.nan [英] Remove outliers (+/- 3 std) and replace with np.nan in Python/pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

移除异常值(+/- 3 std)并在Python/pandas中替换为np.nan [英] Remove outliers (+/- 3 std) and replace with np.nan in Python/pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭