Python pandas:如何获取混合数据类型列中对象的数据类型? [英] Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?

查看:108
本文介绍了Python pandas:如何获取混合数据类型列中对象的数据类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个 pandas.DataFrame,其中一列包含混合数据类型,例如

Given a pandas.DataFrame with a column holding mixed datatypes, like e.g.

df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})

我想知道如何获取列(系列)中各个对象的数据类型?假设我想修改系列中属于某种类型的所有条目,例如将所有整数乘以某个因子.

I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.

我可以迭代地推导出一个掩码并在 loc 中使用它,比如

I could iteratively derive a mask and use it in loc, like

m = np.array([isinstance(v, int) for v in df['mixed']])

df.loc[m, 'mixed'] *= 10

# df
#                  mixed
# 0  2020-10-04 00:00:00
# 1                 9990
# 2             a string

这行得通,但我想知道是否有一种更pandastic 的方法来做到这一点?

That does the trick but I was wondering if there was a more pandastic way of doing this?

推荐答案

一个想法是通过 to_numeric 带有 errors='coerce' 和非缺失值:

One idea is test if numeric by to_numeric with errors='coerce' and for non missing values:

m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string

不幸的是它很慢,一些其他想法:

Unfortunately is is slow, some another ideas:

N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})


In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

默认情况下,这里的 Pandas 不能有效地使用矢量化,因为混合值 - 所以是必要的元素方法.

Pandas by default here cannot use vectorization effectivelly, because mixed values - so is necessary elementwise approaches.

这篇关于Python pandas:如何获取混合数据类型列中对象的数据类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆