在 pandas 中用 NaN 替换空白值(空格) [英] Replacing blank values (white space) with NaN in pandas

查看:64
本文介绍了在 pandas 中用 NaN 替换空白值(空格)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Pandas 数据框中查找包含空格(任意数量)的所有值,并用 NaN 替换这些值.

I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.

有什么可以改进的想法吗?

Any ideas how this can be improved?

基本上我想转这个:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux     

进入这个:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

我已经设法用下面的代码来做到这一点,但它很难看.它不是 Pythonic,我相信它也不是最有效地使用 Pandas.我遍历每一列,并对通过应用一个函数生成的列掩码进行布尔替换,该函数对每个值进行正则表达式搜索,匹配空格.

I've managed to do it with the code below, but man is it ugly. It's not Pythonic and I'm sure it's not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^s*$', str(i)) else False)]=None

它可以通过只迭代可能包含空字符串的字段来优化:

It could be optimized a bit by only iterating through fields that could contain empty strings:

if df[i].dtype == np.dtype('object')

但这并没有太大的改进

最后,这段代码将目标字符串设置为 None,它可以与 Pandas 的 fillna() 等函数一起使用,但如果我能真正插入一个 NaN,那么它的完整性会很好 直接代替 None.

And finally, this code sets the target strings to None, which works with Pandas' functions like fillna(), but it would be nice for completeness if I could actually insert a NaN directly instead of None.

推荐答案

我认为 df.replace() 可以胜任,因为 pandas 0.13:

I think df.replace() does the job, since pandas 0.13:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^s*$', np.nan, regex=True))

产生:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

<小时>

正如 Temak 指出的那样,使用 df.replace(r'^s+$', np.nan, regex=True) 以防您的有效数据包含空格.


As Temak pointed it out, use df.replace(r'^s+$', np.nan, regex=True) in case your valid data contains white spaces.

这篇关于在 pandas 中用 NaN 替换空白值(空格)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆