将pandas.Series从dtype对象转换为float,将错误转换为nans [英] Convert pandas.Series from dtype object to float, and errors to nans
问题描述
请考虑以下情况:
In [2]: a = pd.Series([1,2,3,4,'.'])
In [3]: a
Out[3]:
0 1
1 2
2 3
3 4
4 .
dtype: object
In [8]: a.astype('float64', raise_on_error = False)
Out[8]:
0 1
1 2
2 3
3 4
4 .
dtype: object
我本来希望有一个选项,该选项允许在将错误值(例如.
)转换为NaN
s时进行转换.有没有办法做到这一点?
I would have expected an option that allows conversion while turning erroneous values (such as that .
) to NaN
s. Is there a way to achieve this?
推荐答案
使用
pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
dtype: float64
If you need the NaN
s filled in, use Series.fillna
.
pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')
0 1
1 2
2 3
3 4
4 0
dtype: float64
请注意,downcast='infer'
将尝试在可能的情况下将浮点型转换为整数.如果不想,请删除该参数.
Note, downcast='infer'
will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.
从v0.24 +开始,pandas引入了可空整数类型,允许
与NaN共存的整数.如果您的栏中有整数,
您可以使用
From v0.24+, pandas introduces a Nullable Integer type, which allows
integers to coexist with NaNs. If you have integers in your column,
you can use
pd.__version__
# '0.24.1'
pd.to_numeric(s, errors='coerce').astype('Int32')
0 1
1 2
2 3
3 4
4 NaN
dtype: Int32
还有其他选项可供选择,请阅读文档以了解更多信息.
There are other options to choose from as well, read the docs for more.
DataFrames
的扩展名
如果需要将此扩展到DataFrames,则需要将其 apply 应用于每一行.您可以使用 DataFrame.apply
进行此操作.
Extension for DataFrames
If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply
.
# Setup.
np.random.seed(0)
df = pd.DataFrame({
'A' : np.random.choice(10, 5),
'C' : np.random.choice(10, 5),
'B' : ['1', '###', '...', 50, '234'],
'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df
A B C D
0 5 1 9 23
1 0 ### 3 1
2 3 ... 5 ...
3 3 50 2 268
4 7 234 4 $$
df.dtypes
A int64
B object
C int64
D object
dtype: object
df2 = df.apply(pd.to_numeric, errors='coerce')
df2
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
df2.dtypes
A int64
B float64
C int64
D float64
dtype: object
You can also do this with DataFrame.transform
; although my tests indicate this is marginally slower:
df.transform(pd.to_numeric, errors='coerce')
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
如果您有许多列(数字;非数字),则可以通过仅在非数字列上应用pd.to_numeric
来提高性能.
If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric
on the non-numeric columns only.
df.dtypes.eq(object)
A False
B True
C False
D True
dtype: bool
cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
# df[c] = pd.to_numeric(df[c], errors='coerce')
df
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
对于长的DataFrame,沿列
应用pd.to_numeric
(即默认为axis=0
)应稍快一些.
Applying pd.to_numeric
along the columns (i.e., axis=0
, the default) should be slightly faster for long DataFrames.
这篇关于将pandas.Series从dtype对象转换为float,将错误转换为nans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
dtype: float64
NaN
s filled in, use Series.fillna
.pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')
0 1
1 2
2 3
3 4
4 0
dtype: float64
downcast='infer'
将尝试在可能的情况下将浮点型转换为整数.如果不想,请删除该参数.downcast='infer'
will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.从v0.24 +开始,pandas引入了可空整数类型,允许 与NaN共存的整数.如果您的栏中有整数, 您可以使用
From v0.24+, pandas introduces a Nullable Integer type, which allows integers to coexist with NaNs. If you have integers in your column, you can use
pd.__version__
# '0.24.1'
pd.to_numeric(s, errors='coerce').astype('Int32')
0 1
1 2
2 3
3 4
4 NaN
dtype: Int32
DataFrames
DataFrame.apply
进行此操作. DataFrames
DataFrame.apply
. # Setup.
np.random.seed(0)
df = pd.DataFrame({
'A' : np.random.choice(10, 5),
'C' : np.random.choice(10, 5),
'B' : ['1', '###', '...', 50, '234'],
'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df
A B C D
0 5 1 9 23
1 0 ### 3 1
2 3 ... 5 ...
3 3 50 2 268
4 7 234 4 $$
df.dtypes
A int64
B object
C int64
D object
dtype: object
df2 = df.apply(pd.to_numeric, errors='coerce')
df2
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
df2.dtypes
A int64
B float64
C int64
D float64
dtype: object
DataFrame.transform
; although my tests indicate this is marginally slower:df.transform(pd.to_numeric, errors='coerce')
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
pd.to_numeric
来提高性能.pd.to_numeric
on the non-numeric columns only.df.dtypes.eq(object)
A False
B True
C False
D True
dtype: bool
cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
# df[c] = pd.to_numeric(df[c], errors='coerce')
df
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
pd.to_numeric
(即默认为axis=0
)应稍快一些. pd.to_numeric
along the columns (i.e., axis=0
, the default) should be slightly faster for long DataFrames.