通过使用正则表达式将值替换为np.nan [英] replace value by using regex to np.nan
问题描述
我有一个数据框,如下所示:
I have a dataframe as below :
data1 = {"first":["alice", "bob", "carol"],
"last_huge":["foo", "bar", "baz"]}
df = pd.DataFrame(data1)
例如,我想将所有字符"o"替换为"a":
For example , I want to replace all character 'o' to 'a':
然后我做
df.replace({"o":"a"},regex=True)
Out[668]:
first last
0 alice faa
1 bab bar
2 caral baz
它还给我我需要的东西.
It give back what I need .
但是,当我要将'o'替换为np.nan
时,它将把整个字符串更改为np.nan
. 熊猫的文档中有任何解释吗? ?我可以通过
However, when I want to replace 'o' to np.nan
, It will change entire string to np.nan
. Is there any explanation from pandas' document? I can find some information through the source code .
更多信息:(它将整个字符串更改为np.nan
)
More Information:(It will change whole string to np.nan
)
df.replace({"o":np.nan},regex=True)
Out[669]:
first last
0 alice NaN
1 NaN bar
2 NaN baz
推荐答案
NaN始终用作丢失的占位符 ,当用"missing"替换字符串的一部分时,这仅意味着整个条目已被破坏.我听说过这种叫做NaN污染的方法(或类似方法,我会看看是否能找到一些参考资料),因为如果NaN接触到,数据就会受到损害.
NaN is consistently used as a placeholder for missing, when replacing part of a string with "missing" it can only mean the entire entry is compromised. I've heard this called NaN pollution (or similar, will see if I can find some references), in that if NaN touches the data is compromised.
也就是说,并非总是如此:
That said, that's not always the case:
In [11]: s = pd.Series([1, 2, np.nan, 4])
In [12]: s.sum()
Out[12]: 7.0
In [13]: s.sum(skipna=False)
Out[13]: nan
在某些语言中,您会看到skipna = False作为默认行为,有些人激烈地争论说NaN应该始终污染所有数据.熊猫采取了一种更为务实的方法...
In some languages you'll see skipna=False as the default behaviour, some vehemently argue that NaN should always pollute all data. Pandas takes a somewhat more pragmatic approach...
真正的问题是,在NaN的情况下,您希望它做什么?
The real question is what do you expect it to do in the case of NaN?
这篇关于通过使用正则表达式将值替换为np.nan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!