将pandas.Series从dtype对象转换为float,将错误转换为nans [英] Convert pandas.Series from dtype object to float, and errors to nans

查看:293
本文介绍了将pandas.Series从dtype对象转换为float,将错误转换为nans的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下情况:

In [2]: a = pd.Series([1,2,3,4,'.'])

In [3]: a
Out[3]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

In [8]: a.astype('float64', raise_on_error = False)
Out[8]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

我本来希望有一个选项,该选项允许在将错误值(例如.)转换为NaN s时进行转换.有没有办法做到这一点?

I would have expected an option that allows conversion while turning erroneous values (such as that .) to NaNs. Is there a way to achieve this?

推荐答案

使用

pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
dtype: float64

如果需要填写NaN,请使用

If you need the NaNs filled in, use Series.fillna.

pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')

0    1
1    2
2    3
3    4
4    0
dtype: float64

请注意,downcast='infer'将尝试在可能的情况下将浮点型转换为整数.如果不想,请删除该参数.

Note, downcast='infer' will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.

从v0.24 +开始,pandas引入了可空整数类型,允许 与NaN共存的整数.如果您的栏中有整数, 您可以使用

From v0.24+, pandas introduces a Nullable Integer type, which allows integers to coexist with NaNs. If you have integers in your column, you can use

pd.__version__
# '0.24.1'

pd.to_numeric(s, errors='coerce').astype('Int32')

0      1
1      2
2      3
3      4
4    NaN
dtype: Int32

还有其他选项可供选择,请阅读文档以了解更多信息.

There are other options to choose from as well, read the docs for more.


DataFrames

的扩展名

如果需要将此扩展到DataFrames,则需要将其 apply 应用于每一行.您可以使用 DataFrame.apply 进行此操作.


Extension for DataFrames

If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply.

# Setup.
np.random.seed(0)
df = pd.DataFrame({
    'A' : np.random.choice(10, 5), 
    'C' : np.random.choice(10, 5), 
    'B' : ['1', '###', '...', 50, '234'], 
    'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df

   A    B  C    D
0  5    1  9   23
1  0  ###  3    1
2  3  ...  5  ...
3  3   50  2  268
4  7  234  4   $$

df.dtypes

A     int64
B    object
C     int64
D    object
dtype: object

df2 = df.apply(pd.to_numeric, errors='coerce')
df2

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

df2.dtypes

A      int64
B    float64
C      int64
D    float64
dtype: object

您也可以使用

You can also do this with DataFrame.transform; although my tests indicate this is marginally slower:

df.transform(pd.to_numeric, errors='coerce')

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

如果您有许多列(数字;非数字),则可以通过仅在非数字列上应用pd.to_numeric来提高性能.

If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric on the non-numeric columns only.

df.dtypes.eq(object)

A    False
B     True
C    False
D     True
dtype: bool

cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
#     df[c] = pd.to_numeric(df[c], errors='coerce')

df

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

对于长的DataFrame,沿列

应用pd.to_numeric(即默认为axis=0)应稍快一些.

Applying pd.to_numeric along the columns (i.e., axis=0, the default) should be slightly faster for long DataFrames.

这篇关于将pandas.Series从dtype对象转换为float,将错误转换为nans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆