通过使用正则表达式将值替换为np.nan [英] replace value by using regex to np.nan

查看:57
本文介绍了通过使用正则表达式将值替换为np.nan的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,如下所示:

I have a dataframe as below :

data1 = {"first":["alice", "bob", "carol"],
         "last_huge":["foo", "bar", "baz"]}
df = pd.DataFrame(data1)

例如,我想将所有字符"o"替换为"a":

For example , I want to replace all character 'o' to 'a':

然后我做

df.replace({"o":"a"},regex=True)
Out[668]: 
   first last
0  alice  faa
1    bab  bar
2  caral  baz

它还给我我需要的东西.

It give back what I need .

但是,当我要将'o'替换为np.nan时,它将把整个字符串更改为np.nan. 熊猫的文档中有任何解释吗? ?我可以通过

However, when I want to replace 'o' to np.nan , It will change entire string to np.nan. Is there any explanation from pandas' document? I can find some information through the source code .

更多信息:(它将整个字符串更改为np.nan)

More Information:(It will change whole string to np.nan)

df.replace({"o":np.nan},regex=True)
Out[669]: 
   first last
0  alice  NaN
1    NaN  bar
2    NaN  baz

推荐答案

NaN始终用作丢失的占位符 ,当用"missing"替换字符串的一部分时,这仅意味着整个条目已被破坏.我听说过这种叫做NaN污染的方法(或类似方法,我会看看是否能找到一些参考资料),因为如果NaN接触到,数据就会受到损害.

NaN is consistently used as a placeholder for missing, when replacing part of a string with "missing" it can only mean the entire entry is compromised. I've heard this called NaN pollution (or similar, will see if I can find some references), in that if NaN touches the data is compromised.

也就是说,并非总是如此:

That said, that's not always the case:

In [11]: s = pd.Series([1, 2, np.nan, 4])

In [12]: s.sum()
Out[12]: 7.0

In [13]: s.sum(skipna=False)
Out[13]: nan

在某些语言中,您会看到skipna = False作为默认行为,有些人激烈地争论说NaN应该始终污染所有数据.熊猫采取了一种更为务实的方法...

In some languages you'll see skipna=False as the default behaviour, some vehemently argue that NaN should always pollute all data. Pandas takes a somewhat more pragmatic approach...

真正的问题是,在NaN的情况下,您希望它做什么?

The real question is what do you expect it to do in the case of NaN?

这篇关于通过使用正则表达式将值替换为np.nan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆