pandas 与正则表达式“"不一致.点元字符? [英] Pandas inconsistency with regex "." dot metacharacter?

查看:71
本文介绍了 pandas 与正则表达式“"不一致.点元字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑

df

              Cost
Store 1       22.5
Store 1  .........
Store 2        ...

要将这些点转换为nan,我可以使用:

To convert these the dots to nan, I can use:

df.replace('^\.+$', np.nan, regex=True)

         Cost
Store 1  22.5
Store 1   NaN
Store 2   NaN

我不明白为什么以下模式也能起作用:

What I don't understand is why the following pattern also works:

df.replace('^.+$', np.nan, regex=True)

         Cost
Store 1  22.5
Store 1   NaN
Store 2   NaN

请注意,在这种情况下,我没有转义.,因此应将其视为Matchall字符,导致每一行都转换为NaN ...但事实并非如此. .只有....行被匹配... 尽管我使用的是matchall字符.

Note that, in this case, I haven't escaped the ., so it should be treated as a matchall character, resulting in every single row being converted to NaN... but it isn't.... only the .... rows are matched... even though I used the matchall character.

与此进行对比:

import re
re.sub('^.+$', '', '22.5') 
''

将返回一个空字符串.

Which returns an empty string.

那是怎么回事?

推荐答案

通过写这个问题的半途,我意识到了问题所在:

Halfway through writing this question, I realised what the problem was:

df.Cost.dtype
dtype('O')

df.Cost.values
array([22.5, '.........', '...'], dtype=object)

因此,22.5恰好是一个数字值,而正则表达式模式在尝试替换时只是在非字符串值上跳过了 .进行astype转换很明显:

So, the 22.5 happens to be a numeric value, and the regex pattern simply skips over non-string values when attempting to replace. Doing an astype conversion makes it obvious:

df.astype(str).replace('.+', np.nan, regex=True)

         Cost
Store 1   NaN
Store 1   NaN
Store 2   NaN

问题解决了.留着这个,以防其他人对此感到困惑.

Problem solved. Leaving this up in case anyone else is confused by this.

这篇关于 pandas 与正则表达式“"不一致.点元字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆