pandas 与正则表达式“"不一致.点元字符? [英] Pandas inconsistency with regex "." dot metacharacter?
问题描述
考虑
df
Cost
Store 1 22.5
Store 1 .........
Store 2 ...
要将这些点转换为nan,我可以使用:
To convert these the dots to nan, I can use:
df.replace('^\.+$', np.nan, regex=True)
Cost
Store 1 22.5
Store 1 NaN
Store 2 NaN
我不明白为什么以下模式也能起作用:
What I don't understand is why the following pattern also works:
df.replace('^.+$', np.nan, regex=True)
Cost
Store 1 22.5
Store 1 NaN
Store 2 NaN
请注意,在这种情况下,我没有转义.
,因此应将其视为Matchall字符,导致每一行都转换为NaN ...但事实并非如此. .只有....
行被匹配... 尽管我使用的是matchall字符.
Note that, in this case, I haven't escaped the .
, so it should be treated as a matchall character, resulting in every single row being converted to NaN... but it isn't.... only the ....
rows are matched... even though I used the matchall character.
与此进行对比:
import re
re.sub('^.+$', '', '22.5')
''
将返回一个空字符串.
Which returns an empty string.
那是怎么回事?
推荐答案
通过写这个问题的半途,我意识到了问题所在:
Halfway through writing this question, I realised what the problem was:
df.Cost.dtype
dtype('O')
df.Cost.values
array([22.5, '.........', '...'], dtype=object)
因此,22.5
恰好是一个数字值,而正则表达式模式在尝试替换时只是在非字符串值上跳过了 .进行astype
转换很明显:
So, the 22.5
happens to be a numeric value, and the regex pattern simply skips over non-string values when attempting to replace. Doing an astype
conversion makes it obvious:
df.astype(str).replace('.+', np.nan, regex=True)
Cost
Store 1 NaN
Store 1 NaN
Store 2 NaN
问题解决了.留着这个,以防其他人对此感到困惑.
Problem solved. Leaving this up in case anyone else is confused by this.
这篇关于 pandas 与正则表达式“"不一致.点元字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!