使用 pandas 数据框时枚举的怪异行为 [英] Weird Behaviour of Enumerate while using pandas dataframe
问题描述
我有一个数据框(df):
I have a dataframe(df):
df = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5],'f':[6]},index=[0])
我在行上使用枚举.
res = [tuple(x) for x in enumerate(df.values)]
print(res)
>>> [(1, 1, 6, 4, 2, 3, 5)] ### the elements are int type
现在,当我更改数据框df的一列的数据类型时:
Now when i change the datatype of one column of my dataframe df:
df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])
再次使用枚举,我得到:
and again use enumerate, i get:
res2 = [tuple(x) for x in enumerate(df2.values)]
print(res2)
>>> [(1, 1.0, 6.0, 4.0, 2.0, 3.0, 5.5)] ### the elements data type has changed
我不明白为什么?
我也在寻找一种解决方案,我必须将其转换为自己的数据类型. 例如.
Also i am looking for a solution where i have to get it in its own datatype. For eg.
res = [(1, 1, 6, 4, 2, 3, 5.5)]
我该如何实现?
推荐答案
这与enumerate
无关,这是一条红色的鲱鱼.问题是您正在寻找混合类型的输出,而Pandas更喜欢存储同类数据.
This has nothing to do with enumerate
, that's a red herring. The issue is you are looking for mixed type output whereas Pandas prefers storing homogeneous data.
不推荐与熊猫一起寻找.您的数据类型应为int
或float
,而不是组合.这具有性能影响,因为唯一直接的替代方法是使用object
dtype系列,该系列仅允许在Python时间内进行操作.转换为object
dtype效率不高.
What you are looking for is not recommended with Pandas. Your data type should be int
or float
, not a combination. This has performance repercussions, since the only straightforward alternative is to use object
dtype series, which only permits operations in Python time. Converting to object
dtype is inefficient.
这就是您可以做到的:
res2 = df2.astype(object).values.tolist()[0]
print(res2)
[1, 6, 4, 2, 3, 5.5]
一种避免object
转换的方法:
from itertools import chain
from operator import itemgetter, methodcaller
iter_series = map(itemgetter(1), df2.items())
res2 = list(chain.from_iterable(map(methodcaller('tolist'), iter_series)))
[1, 6, 4, 2, 3, 5.5]
性能基准测试
如果要输出一个元组列表 ,每行一个元组,则基于序列的解决方案的性能会更好:-
If you want a list of tuples as output, one tuple for each row, then the series-based solution performs better:-
# Python 3.6.0, Pandas 0.19.2
df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])
from itertools import chain
from operator import itemgetter, methodcaller
n = 10**5
df2 = pd.concat([df2]*n)
def jpp_series(df2):
iter_series = map(itemgetter(1), df2.items())
return list(zip(*map(methodcaller('tolist'), iter_series)))
def jpp_object1(df2):
return df2.astype(object).values.tolist()
def jpp_object2(df2):
return list(map(tuple, df2.astype(object).values.tolist()))
assert jpp_series(df2) == jpp_object2(df2)
%timeit jpp_series(df2) # 39.7 ms per loop
%timeit jpp_object1(df2) # 43.7 ms per loop
%timeit jpp_object2(df2) # 68.2 ms per loop
这篇关于使用 pandas 数据框时枚举的怪异行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!