明智而有效地替换pandas dataframe元素中不需要的字符串 [英] Replace unwanted strings in pandas dataframe element wise and efficiently

查看:91
本文介绍了明智而有效地替换pandas dataframe元素中不需要的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的数据框(千x千),此处仅显示5 x 3,时间是索引

I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index

                                  col1                col2             col3
time                                                                         
05/04/2018 05:14:52 AM             +unend           +unend                  0
05/04/2018 05:14:57 AM                 0                 0                  0
05/04/2018 05:15:02 AM            30.691             0.000              0.121
05/04/2018 05:15:07 AM            30.691             n. def.            0.108
05/04/2018 05:15:12 AM            30.715             0.000              0.105

由于这些来自其他设备(df由pd.read_csv(filename)产生),因此数据帧不再是完全float类型,现在最终会出现不需要的字符串,例如+unendn. def..这些不是经典的+infinityNaN,而df.fillna()可能会引起注意.我想用0.0替换字符串.我看到了以下答案熊猫替换类型问题在pandas数据帧中替换字符串,尽管它试图做同样的事情,但在列或行上都是明智的,但是不是元素.但是,在评论中,也有一些很好的暗示可以进行一般案例.

As these are coming from some other device (df is produced by pd.read_csv(filename)) the dataframe instead of being a completely float type now ends up having unwanted strings like +unend and n. def.. These are not the classical +infinity or NaN , that df.fillna() could take care off. I would like to replace the strings with 0.0. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.

如果我尝试做

mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0

我得到error: nothing to repeat

如果我愿意

mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0

我将获得每列具有True或False的Series对象,而不是元素掩码,因此将出现错误 TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.

i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.

以下

mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0

确实给了我预期的结果,用0.0替换了所有不需要的字符串.但是,它很慢(unpythonic?),而且,我不确定是否可以使用正则表达式而不是in进行检查,特别是如果我知道有混合的数据类型.是否有一种有效,快速,健壮但又逐元素的通用方法来做到这一点?

does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?

推荐答案

errors='coerce'一起使用的稳定/生成/pandas.to_numeric.html"rel =" nofollow noreferrer> to_numeric 为无法解析的值创建NaN,然后通过 fillna :

As pointed Edchum if need replace all non numeric values to 0 - first to_numeric with errors='coerce' create NaNs for not parseable values and then convert them to 0 by fillna:

df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)


如果值不是substring,请使用 DataFrame.isin Haleemur Ali 的非常好的答案:


If values are not substrings use DataFrame.isin or very nice answer of Haleemur Ali:

df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)

对于具有定义值的substrings:

有特殊的正则表达式char +.,因此需要通过\进行转义:

There are special regex char + and ., so need escape them by \:

df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)

或使用 applymap 进行逐行检查:

Or use applymap for elemnetwise check:

df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)


print (df)
                          col1  col2   col3
time                                       
05/04/2018 05:14:52 AM   0.000   0.0  0.000
05/04/2018 05:14:57 AM   0.000   0.0  0.000
05/04/2018 05:15:02 AM  30.691   0.0  0.121
05/04/2018 05:15:07 AM  30.691   0.0  0.108
05/04/2018 05:15:12 AM  30.715   0.0  0.105

这篇关于明智而有效地替换pandas dataframe元素中不需要的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆