将 pandas.Series 从 dtype 对象转换为 float,并将错误转换为 nans [英] Convert pandas.Series from dtype object to float, and errors to nans

查看:190
本文介绍了将 pandas.Series 从 dtype 对象转换为 float,并将错误转换为 nans的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下情况:

In [2]: a = pd.Series([1,2,3,4,'.'])

In [3]: a
Out[3]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

In [8]: a.astype('float64', raise_on_error = False)
Out[8]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

我希望有一个选项可以在将错误值(例如 .)转换为 NaN 的同时进行转换.有没有办法做到这一点?

I would have expected an option that allows conversion while turning erroneous values (such as that .) to NaNs. Is there a way to achieve this?

推荐答案

使用 pd.to_numeric 带有 errors='coerce'

# Setup
s = pd.Series(['1', '2', '3', '4', '.'])
s

0    1
1    2
2    3
3    4
4    .
dtype: object

pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
dtype: float64

如果您需要填写NaN,请使用Series.fillna.

If you need the NaNs filled in, use Series.fillna.

pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')

0    1
1    2
2    3
3    4
4    0
dtype: float64

注意,downcast='infer' 会在可能的情况下尝试将浮点数向下转换为整数.如果您不想要,请删除该参数.

Note, downcast='infer' will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.

从 v0.24+ 开始,pandas 引入了可空整数 类型,允许整数与 NaN 共存.如果您的列中有整数,你可以使用

From v0.24+, pandas introduces a Nullable Integer type, which allows integers to coexist with NaNs. If you have integers in your column, you can use

pd.__version__
# '0.24.1'

pd.to_numeric(s, errors='coerce').astype('Int32')

0      1
1      2
2      3
3      4
4    NaN
dtype: Int32

还有其他选项可供选择,请阅读文档了解更多信息.

There are other options to choose from as well, read the docs for more.

<小时>

DataFrames

的扩展

如果您需要将其扩展到 DataFrames,则需要将其应用到每一行.您可以使用 DataFrame.apply.


Extension for DataFrames

If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply.

# Setup.
np.random.seed(0)
df = pd.DataFrame({
    'A' : np.random.choice(10, 5), 
    'C' : np.random.choice(10, 5), 
    'B' : ['1', '###', '...', 50, '234'], 
    'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df

   A    B  C    D
0  5    1  9   23
1  0  ###  3    1
2  3  ...  5  ...
3  3   50  2  268
4  7  234  4   $$

df.dtypes

A     int64
B    object
C     int64
D    object
dtype: object

df2 = df.apply(pd.to_numeric, errors='coerce')
df2

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

df2.dtypes

A      int64
B    float64
C      int64
D    float64
dtype: object

您也可以使用 DataFrame.transform;虽然我的测试表明这稍微慢了一点:

You can also do this with DataFrame.transform; although my tests indicate this is marginally slower:

df.transform(pd.to_numeric, errors='coerce')

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

如果您有很多列(数字;非数字),您可以通过仅在非数字列上应用 pd.to_numeric 来提高性能.

If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric on the non-numeric columns only.

df.dtypes.eq(object)

A    False
B     True
C    False
D     True
dtype: bool

cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
#     df[c] = pd.to_numeric(df[c], errors='coerce')

df

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

沿列应用 pd.to_numeric(即 axis=0,默认值)对于长数据帧应该稍微快一点.

Applying pd.to_numeric along the columns (i.e., axis=0, the default) should be slightly faster for long DataFrames.

这篇关于将 pandas.Series 从 dtype 对象转换为 float,并将错误转换为 nans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆