pandas 中的高效链合并 [英] Efficient chain merge in pandas

查看:62
本文介绍了 pandas 中的高效链合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现,当您在同一列中合并大量具有大量列的数据集时,与熊猫库的直接链合并效率很低.

I found that straightforward chain merging with pandas library is quite inefficient when you merge a lot of datasets with a big number of columns by the same column.

问题的根源与我们加入许多str的愚蠢方式时相同: 加入= reduce(lambda a + b,str_list) 代替: join =''.join(str_list)

The root of the problem is the same as when we join a lot of str's dumb way: joined = reduce(lambda a + b, str_list) Instead of: joined = ''.join(str_list)

进行链合并时,我们会多次复制数据集(在我的情况下几乎是100次),而不仅仅是一次或按顺序填充多个数据集中的列

Doing chain merge we copying dataset many times (in my case almost 100 times) instead of just filling columns from several datasets at once or in order

是否有某种有效的方法(=线性复杂度乘以组数)来按同一列合并许多数据集?

Is there is some efficient way (= with linear complexity by the number of sets) to chain merge by the same column a lot of datasets?

推荐答案

如果有数据框列表dfs:

dfs = [df1, df2, df3, ... , dfn]

您可以使用熊猫的 concat 函数比链接合并要快. concat仅基于索引(而不是列)联接数据帧,但是只需进行少量预处理就可以模拟merge操作.

you can join them using panda's concat function which as far as I can tell is faster than chaining merge. concat only joins dataframes based on an index (not a column) but with a little pre-processing you can simulate a merge operation.

首先用要合并的列替换dfs中每个数据框的索引.假设您要在列"A"上合并:

First replace the index of each of your dataframes in dfs with the column you want to merge on. Lets say you want to merge on column "A":

dfs = [df.set_index("A", drop=True) for df in dfs]

请注意,此覆盖以前的索引(无论如何,合并都将执行此操作),因此您可能希望将这些索引保存在某个位置(如果出于某种原因以后需要它们).

Note that this will overwrite the previous indices (merge would do this anyway) so you might want to save these indices somewhere (if you are going to need them later for some reason).

现在我们可以使用concat了,它实际上将在索引上合并(实际上是实际上是您的列 !!)

Now we can use concat which will essentially merge on the index (which is actually your column!!)

merged = pd.concat(dfs, axis=1, keys=range(len(dfs)), join='outer', copy=False)

join=参数可以是'inner''outer'(默认值). copy=参数可防止concat制作不必要的数据帧副本.

The join= argument can either be 'inner' or 'outer' (default). The copy= argument keeps concat from making unnecessary copies of your dataframes.

然后您可以将"A"保留为索引,也可以通过执行以下操作将其放回到列中:

You can then either leave "A" as the index or you can make it back into a column by doing:

merged.reset_index(drop=False, inplace=True)

keys=参数是可选的,它为每个数据帧分配一个键值(在这种情况下,我给了它一个整数范围,但是如果需要,可以给它们提供其他标签).这使您可以访问原始数据帧中的列.因此,如果您想获取与dfs中的第20个数据帧相对应的列,则可以调用:

The keys= argument is optional and assigns a key value to each dataframe (in this case I gave it a range of integers but you could give them other labels if you want). This allows you to access columns from the original dataframes. So if you wanted to get the columns that correspond to the 20th dataframe in dfs you can call:

merged[20]

如果没有keys=参数,则可能会混淆哪些行来自哪个数据帧,尤其是当它们具有相同的列名时.

Without the keys= argument it can get confusing which rows are from which dataframes, especially if they have the same column names.

我仍然不确定concat是否在线性时间内运行,但绝对比链接merge快:

I'm still not entirely sure if concat runs in linear time but it is definitely faster than chaining merge:

在随机生成的数据框列表(包含10、100和1000个数据框的列表)上使用ipython的%timeit:

using ipython's %timeit on lists of randomly generated dataframes (lists of 10, 100 and 1000 dataframes):

def merge_with_concat(dfs, col):                                             
    dfs = [df.set_index(col, drop=True) for df in dfs]
    merged = pd.concat(dfs, axis=1, keys=range(len(dfs)), join='outer', copy=False)
    return merged

dfs10 = [pd.util.testing.makeDataFrame() for i in range(10)] 
dfs100 = [pd.util.testing.makeDataFrame() for i in range(100)] 
dfs1000 = [pd.util.testing.makeDataFrame() for i in range(1000)] 

%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs10)
10 loops, best of 3: 45.8 ms per loop
%timeit merge_with_concat(dfs10,"A")
100 loops, best of 3: 11.7 ms per loop

%timeit merge_with_concat(dfs100,"A")
10 loops, best of 3: 139 ms per loop
%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs100)
1 loop, best of 3: 1.55 s per loop

%timeit merge_with_concat(dfs1000,"A")
1 loop, best of 3: 9.67 s per loop
%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs1000)
# I killed it after about 5 minutes so the other one is definitely faster

这篇关于 pandas 中的高效链合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆