优化两个Pandas Dataframe之间的笛卡尔积 [英] Optimizing cartesian product between two Pandas Dataframe

查看:285
本文介绍了优化两个Pandas Dataframe之间的笛卡尔积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有相同列的数据框:

I have two dataframes with the same columns:

数据框1 :

          attr_1  attr_77 ... attr_8
userID                              
John      1.2501  2.4196  ... 1.7610
Charles   0.0000  1.0618  ... 1.4813
Genarito  2.7037  4.6707  ... 5.3583
Mark      9.2775  6.7638  ... 6.0071

数据框2 :

          attr_1  attr_77 ... attr_8
petID                              
Firulais  1.2501  2.4196  ... 1.7610
Connie    0.0000  1.0618  ... 1.4813
PopCorn   2.7037  4.6707  ... 5.3583

我想生成所有可能组合的相关性和p值数据框,这将是结果:

I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:

   userId   petID      Correlation    p-value
0  John     Firulais   0.091447       1.222927e-02
1  John     Connie     0.101687       5.313359e-03
2  John     PopCorn    0.178965       8.103919e-07
3  Charles  Firulais   -0.078460      3.167896e-02

问题在于笛卡尔乘积生成了超过300万个元组.花几分钟完成.这是我的代码,我写了两种选择:

The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:

首先,初始数据框架:

df1 = pd.DataFrame({
    'userID': ['John', 'Charles', 'Genarito', 'Mark'],
    'attr_1': [1.2501, 0.0, 2.7037, 9.2775],
    'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],
    'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')

df2 = pd.DataFrame({
    'petID': ['Firulais', 'Connie', 'PopCorn'],
    'attr_1': [1.2501, 0.0, 2.7037],
    'attr_77': [2.4196, 1.0618, 4.6707],
    'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')

选项1 :

# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)

i = 0
for df1_key in df1_keys:
    df1_values = df1.loc[df1_key, :].values
    for df2_key in df2.index:
        df2_values = df2.loc[df2_key, :]
        pearson_res = pearsonr(df1_values, df2_values)

        users[i] = df1_key
        pets[i] = df2_key
        coff[i] = pearson_res[0]
        p_value[i] = pearson_res[1]
        i += 1

# After loop, creates the resulting Dataframe
return pd.DataFrame(data={
    'userID': users,
    'petID': pets,
    'Correlation': coff,
    'p-value': p_value
})

选项2 (较慢) ,摘自有没有一种更快的方法可以解决Pandas的此类问题?还是除了并行化迭代之外别无选择?

Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?

随着数据帧大小的增加,第二个选项可带来更好的运行时间,但是仍然需要几秒钟才能完成.

As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.

预先感谢

推荐答案

在所有测试过的替代方案中,给我最好成绩的替代方案如下:

Of all the alternatives tested, the one that gave me the best results was the following:

  1. 使用 itertools.product().

  1. An iteration product was made with itertools.product().

iterrows 上的所有迭代都是在 并行进程(使用 map 函数) .

All the iterations on both iterrows were performed on a Pool of parallel processes (using a map function).

为提供更多性能,函数compute_row_cython使用 Cython 进行了编译熊猫文档的本节:

To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:

cython_modules.pyx文件中:

from scipy.stats import pearsonr
import numpy as np

def compute_row_cython(row):
    (df1_key, df1_values), (df2_key, df2_values) = row
    cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
    return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]

然后我设置setup.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(name='Compiled Pearson',
      ext_modules=cythonize("cython_modules.pyx")

最后,我将其编译为:python setup.py build_ext --inplace

Finally I compiled it with: python setup.py build_ext --inplace

剩下最终代码,然后:

import itertools
import multiprocessing
from cython_modules import compute_row_cython

NUM_CORES = multiprocessing.cpu_count() - 1

pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()

Dask或使用applymerge函数都没有给我带来更好的结果.甚至无法通过Cython优化应用.实际上,用这两种方法的这种选择给我带来了内存错误,当使用Dask实现解决方案时,我不得不生成多个分区,这降低了性能,因为它必须执行许多I/O操作.

Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.

使用Dask解决方案可以在我的其他问题中找到.

The solution with Dask can be found in my other question.

这篇关于优化两个Pandas Dataframe之间的笛卡尔积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆