Python pandas:如何获取混合数据类型列中对象的数据类型? [英] Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?
问题描述
给定一个 pandas.DataFrame
,其中一列包含混合数据类型,例如
Given a pandas.DataFrame
with a column holding mixed datatypes, like e.g.
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})
我想知道如何获取列(系列)中各个对象的数据类型?假设我想修改系列中属于某种类型的所有条目,例如将所有整数乘以某个因子.
I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.
我可以迭代地推导出一个掩码并在 loc
中使用它,比如
I could iteratively derive a mask and use it in loc
, like
m = np.array([isinstance(v, int) for v in df['mixed']])
df.loc[m, 'mixed'] *= 10
# df
# mixed
# 0 2020-10-04 00:00:00
# 1 9990
# 2 a string
这行得通,但我想知道是否有一种更pandas
tic 的方法来做到这一点?
That does the trick but I was wondering if there was a more pandas
tic way of doing this?
推荐答案
一个想法是通过 to_numeric
带有 errors='coerce'
和非缺失值:
One idea is test if numeric by to_numeric
with errors='coerce'
and for non missing values:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
不幸的是它很慢,一些其他想法:
Unfortunately is is slow, some another ideas:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
默认情况下,这里的 Pandas 不能有效地使用矢量化,因为混合值 - 所以是必要的元素方法.
Pandas by default here cannot use vectorization effectivelly, because mixed values - so is necessary elementwise approaches.
这篇关于Python pandas:如何获取混合数据类型列中对象的数据类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!