使用 pandas 数据框时枚举的怪异行为 [英] Weird Behaviour of Enumerate while using pandas dataframe

查看:94
本文介绍了使用 pandas 数据框时枚举的怪异行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框(df):

I have a dataframe(df):

df = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5],'f':[6]},index=[0])

我在行上使用枚举.

res = [tuple(x) for x in enumerate(df.values)]
print(res)
>>> [(1, 1, 6, 4, 2, 3, 5)]  ### the elements are int type

现在,当我更改数据框df的一列的数据类型时:

Now when i change the datatype of one column of my dataframe df:

df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])

再次使用枚举,我得到:

and again use enumerate, i get:

res2 = [tuple(x) for x in enumerate(df2.values)]
print(res2)
>>> [(1, 1.0, 6.0, 4.0, 2.0, 3.0, 5.5)]  ### the elements data type has changed 

我不明白为什么?

我也在寻找一种解决方案,我必须将其转换为自己的数据类型. 例如.

Also i am looking for a solution where i have to get it in its own datatype. For eg.

res = [(1, 1, 6, 4, 2, 3, 5.5)]

我该如何实现?

推荐答案

这与enumerate无关,这是一条红色的鲱鱼.问题是您正在寻找混合类型的输出,而Pandas更喜欢存储同类数据.

This has nothing to do with enumerate, that's a red herring. The issue is you are looking for mixed type output whereas Pandas prefers storing homogeneous data.

不推荐与熊猫一起寻找.您的数据类型应为intfloat,而不是组合.这具有性能影响,因为唯一直接的替代方法是使用object dtype系列,该系列仅允许在Python时间内进行操作.转换为object dtype效率不高.

What you are looking for is not recommended with Pandas. Your data type should be int or float, not a combination. This has performance repercussions, since the only straightforward alternative is to use object dtype series, which only permits operations in Python time. Converting to object dtype is inefficient.

这就是您可以做到的:

res2 = df2.astype(object).values.tolist()[0]

print(res2)

[1, 6, 4, 2, 3, 5.5]

一种避免object转换的方法:

from itertools import chain
from operator import itemgetter, methodcaller

iter_series = map(itemgetter(1), df2.items())
res2 = list(chain.from_iterable(map(methodcaller('tolist'), iter_series)))

[1, 6, 4, 2, 3, 5.5]

性能基准测试

如果要输出一个元组列表 ,每行一个元组,则基于序列的解决方案的性能会更好:-

If you want a list of tuples as output, one tuple for each row, then the series-based solution performs better:-

# Python 3.6.0, Pandas 0.19.2

df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])

from itertools import chain
from operator import itemgetter, methodcaller

n = 10**5
df2 = pd.concat([df2]*n)

def jpp_series(df2):
    iter_series = map(itemgetter(1), df2.items())
    return list(zip(*map(methodcaller('tolist'), iter_series)))

def jpp_object1(df2):
    return df2.astype(object).values.tolist()

def jpp_object2(df2):
    return list(map(tuple, df2.astype(object).values.tolist()))

assert jpp_series(df2) == jpp_object2(df2)

%timeit jpp_series(df2)   # 39.7 ms per loop
%timeit jpp_object1(df2)  # 43.7 ms per loop
%timeit jpp_object2(df2)  # 68.2 ms per loop

这篇关于使用 pandas 数据框时枚举的怪异行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆