明智而有效地替换pandas dataframe元素中不需要的字符串 [英] Replace unwanted strings in pandas dataframe element wise and efficiently
问题描述
我有一个非常大的数据框(千x千),此处仅显示5 x 3,时间是索引
I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index
col1 col2 col3
time
05/04/2018 05:14:52 AM +unend +unend 0
05/04/2018 05:14:57 AM 0 0 0
05/04/2018 05:15:02 AM 30.691 0.000 0.121
05/04/2018 05:15:07 AM 30.691 n. def. 0.108
05/04/2018 05:15:12 AM 30.715 0.000 0.105
由于这些来自其他设备(df由pd.read_csv(filename)
产生),因此数据帧不再是完全float
类型,现在最终会出现不需要的字符串,例如+unend
和n. def.
.这些不是经典的+infinity
或NaN
,而df.fillna()
可能会引起注意.我想用0.0
替换字符串.我看到了以下答案熊猫替换类型问题和在pandas数据帧中替换字符串,尽管它试图做同样的事情,但在列或行上都是明智的,但是不是元素.但是,在评论中,也有一些很好的暗示可以进行一般案例.
As these are coming from some other device (df is produced by pd.read_csv(filename)
) the dataframe instead of being a completely float
type now ends up having unwanted strings like +unend
and n. def.
. These are not the classical +infinity
or NaN
, that df.fillna()
could take care off. I would like to replace the strings with 0.0
. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.
如果我尝试做
mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0
我得到error: nothing to repeat
如果我愿意
mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0
我将获得每列具有True或False的Series对象,而不是元素掩码,因此将出现错误
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
.
i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
.
以下
mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0
确实给了我预期的结果,用0.0替换了所有不需要的字符串.但是,它很慢(unpythonic?),而且,我不确定是否可以使用正则表达式而不是in
进行检查,特别是如果我知道有混合的数据类型.是否有一种有效,快速,健壮但又逐元素的通用方法来做到这一点?
does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in
, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?
推荐答案
如errors='coerce'一起使用的稳定/生成/pandas.to_numeric.html"rel =" nofollow noreferrer> to_numeric
为无法解析的值创建NaN
,然后通过 fillna
:
As pointed Edchum
if need replace all non numeric values to 0
- first to_numeric
with errors='coerce'
create NaN
s for not parseable values and then convert them to 0
by fillna
:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)
如果值不是substring
,请使用 DataFrame.isin
或 Haleemur Ali 的非常好的答案:
If values are not substring
s use DataFrame.isin
or very nice answer of Haleemur Ali:
df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)
对于具有定义值的substrings
:
有特殊的正则表达式char +
和.
,因此需要通过\
进行转义:
There are special regex char +
and .
, so need escape them by \
:
df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)
或使用 applymap
进行逐行检查:
Or use applymap
for elemnetwise check:
df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)
print (df)
col1 col2 col3
time
05/04/2018 05:14:52 AM 0.000 0.0 0.000
05/04/2018 05:14:57 AM 0.000 0.0 0.000
05/04/2018 05:15:02 AM 30.691 0.0 0.121
05/04/2018 05:15:07 AM 30.691 0.0 0.108
05/04/2018 05:15:12 AM 30.715 0.0 0.105
这篇关于明智而有效地替换pandas dataframe元素中不需要的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!