将 pandas 数据框拆分为子数据框列表的最快方法 [英] Fastest way to split a pandas dataframe into a list of subdataframes
问题描述
我有一个大数据框df
,为此我有一个df.index
中唯一元素的完整列表indices
.我现在想创建一个由indices
中的元素索引的所有子数据帧的列表;特别地
I have a large dataframe df
for which I have a full list indices
of unique elements in df.index
. I now want to create a list of all the subdataframes indexed by elements in indices
; specifically
list_df = [df.loc[x] for x in indices]
尽管运行此命令要花一些时间(df
大约有3e6
行和3e3
唯一索引).这是执行此操作的合理方法吗?我很高兴收到任何可以改善此问题和相关问题的性能的评论或建议.
Running this command is taking ages though (df
has about 3e6
rows, and 3e3
unique indices). Is this a reasonable way to perform this operation? I would be very happy to receive any kind of comments or suggestions that could improve the performance of this and related problems.
提前谢谢!
推荐答案
You can use list comprehension in groupby
object by index - level=0
, sort=False
change default sorting for faster solution:
L = [x for i, x in df.groupby(level=0, sort=False)]
np.random.seed(123)
N = 1000
L = list('abcdefghijklmno')
df = pd.DataFrame({'A': np.random.choice(L, N),
'B':np.random.randint(10, size=N)}, index=np.random.randint(100, size=N))
In [273]: %timeit [x for i, x in df.groupby(level=0, sort=False)]
100 loops, best of 3: 9.91 ms per loop
In [274]: %timeit [df.loc[x] for x in df.index]
1 loop, best of 3: 417 ms per loop
这篇关于将 pandas 数据框拆分为子数据框列表的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!