Pandas - 使用来自索引的成对组合将数据帧转换为方阵 [英] Pandas - convert dataframe to square matrix with pairwise combinations from index

查看:50
本文介绍了Pandas - 使用来自索引的成对组合将数据帧转换为方阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将数据框转换为方阵.数据框有一个索引,只有一列带有浮点数.我需要做的是计算所有索引对,并为每对取两个相关列值的平均值.所以,通常的枢轴函数只是解决方案的一部分.

I am converting a data frame to a square matrix. The data frame has an index and only one column with floats. What I need to do is to calculate all pairs of indices, and for each pair take the mean of two associated column values. So, the usual pivot function is only part of the solution.

目前,该函数的估计复杂度为 O(n^2),这并不好,因为我必须处理一次包含数百行数据帧的较大输入.还有其他更快的方法吗?

Currently, the function has an estimated complexity of O(n^2), which is not good as I have to work with larger inputs with data frames with several hundred rows at a time. Is there another faster approach I could take?

示例输入(为简单起见,此处使用整数):

Example input (with integers here for simplicity):

df = pd.DataFrame([3, 4, 5])

更新:转换逻辑

对于示例中的输入数据框:

For an input data frame in the example:

   0

0  3
1  4
2  5

我执行以下操作(但并不声称这是最好的方法):

I do the following (not claiming it is the best way though):

  • 获取所有索引对:(0,1)、(1,2)、(0,2)
  • 对于每一对,计算它们的平均值:(0,1):3.5, (1,2):4.5, (0,2):4.0
  • 使用每对中的索引作为列和行标识符构建一个方形对称矩阵,并在对角线上使用零(如所需的输出所示).

代码在turn_table_into_square_matrix()中.

The code is in the turn_table_into_square_matrix().

期望的输出:

    0   1   2

0   0.0 3.5 4.0
1   3.5 0.0 4.5
2   4.0 4.5 0.0

当前实现:

import pandas as pd
from itertools import combinations 
import time
import string
import random


def turn_table_into_square_matrix(original_dataframe):

    # get all pairs of indices 
    index_pairs = list(combinations(list(original_dataframe.index),2))

    rows_for_final_dataframe = []

    # collect new data frame row by row - the time consuming part
    for pair in index_pairs:
        subset_original_dataframe = original_dataframe[original_dataframe.index.isin(list(pair))]
        rows_for_final_dataframe.append([pair[0], pair[1], subset_original_dataframe[0].mean()])
        rows_for_final_dataframe.append([pair[1], pair[0], subset_original_dataframe[0].mean()])

    final_dataframe = pd.DataFrame(rows_for_final_dataframe)

    final_dataframe.columns = ["from", "to", "weight"]
    final_dataframe_pivot = final_dataframe.pivot(index="from", columns="to", values="weight")
    final_dataframe_pivot = final_dataframe_pivot.fillna(0)

    return final_dataframe_pivot

为性能计时的代码:

for size in range(50, 600, 100):

    index = range(size)
    values = random.sample(range(0, 1000), size)
    example = pd.DataFrame(values, index)

    print ("dataframe size", example.shape)

    start_time = time.time()
    turn_table_into_square_matrix(example)
    print ("conversion time:", time.time()-start_time)

计时结果:

dataframe size (50, 1)
conversion time: 0.5455281734466553

dataframe size (150, 1)
conversion time: 5.001590013504028

dataframe size (250, 1)
conversion time: 14.562285900115967

dataframe size (350, 1)
conversion time: 31.168692111968994

dataframe size (450, 1)
conversion time: 49.07127499580383

dataframe size (550, 1)
conversion time: 78.73740792274475

因此,一个有 50 行的数据帧只需要半秒的时间来转换,而一个有 550 行(长 11 倍)的数据帧需要 79 秒(超过 11^2 倍).这个问题有没有更快的解决方案?

Thus, a data frame of with 50 rows takes only half a second to convert, whereas one with 550 rows (11 times longer) takes 79 seconds (over 11^2 times longer). Is there a faster solution to this problem?

推荐答案

对于该计算,我认为不可能比 O(n^2) 做得更好.正如@piiipmatz 建议的那样,您应该尝试使用 numpy 完成所有操作,然后将结果放入 pd.DataFrame.您的问题听起来像是 numpy.add.at 的一个很好的用例.

I don't think it is possible to do better than O(n^2) for that computation. As @piiipmatz suggested, you should try doing everything with numpy and then put the result in a pd.DataFrame. Your problem sounds like a good use case for numpy.add.at.

这是一个简单的例子

import numpy as np
import itertools

# your original array
x = np.array([1, 4, 8, 99, 77, 23, 4, 45])
n = len(x)
# all pairs of indices in x
a, b = zip(*list(itertools.product(range(n), range(n))))
a, b = np.array(a), np.array(b)
# resulting matrix
result = np.zeros(shape=(n, n))

np.add.at(result, [a, b], (x[a] + x[b]) / 2.0)

print(result)
# [[  1.    2.5   4.5  50.   39.   12.    2.5  23. ]
# [  2.5   4.    6.   51.5  40.5  13.5   4.   24.5]
# [  4.5   6.    8.   53.5  42.5  15.5   6.   26.5]
# [ 50.   51.5  53.5  99.   88.   61.   51.5  72. ]
# [ 39.   40.5  42.5  88.   77.   50.   40.5  61. ]
# [ 12.   13.5  15.5  61.   50.   23.   13.5  34. ]
# [  2.5   4.    6.   51.5  40.5  13.5   4.   24.5]
# [ 23.   24.5  26.5  72.   61.   34.   24.5  45. ]]

这篇关于Pandas - 使用来自索引的成对组合将数据帧转换为方阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆