比较两个不同大小的矩阵以组成一个大矩阵-速度提高了吗? [英] Compare two different size matrices to make one large matrix - Speed Improvements?

查看:100
本文介绍了比较两个不同大小的矩阵以组成一个大矩阵-速度提高了吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个矩阵,我需要使用它们来创建更大的矩阵.每个矩阵只是读取的制表符分隔的文本文件.每个矩阵有48个cols,每个矩阵具有相同的标识符,但行数不同.第一个矩阵是108887x48,第二个矩阵是55482x48.对于每个矩阵,每个位置的条目可以为0或1,因此为二进制.最终输出应将第一个矩阵行ID作为行,将第二个矩阵行ID作为cols,因此最终矩阵为55482x10887.

I have two matrices, that I need to use to create a larger matrix. Each matrix is simply a tab-delimited text file that is read. Each matrix has 48 cols with identical identifiers per matrix, with different numbers of rows. The first matrix is 108887x48, and the second is 55482x48. The entries at each position, for each matrix, can be a 0 or 1, so binary. The final output should have the first matrix row ids as the rows, and the second matrix row ids as the cols, so the final matrix is 55482x10887.

这里需要发生的是,对于第一个矩阵的每一行中的每个pos,对于第二个矩阵的每一行,如果每个矩阵的pos(col)为1,则最终矩阵计数将增加1 .最终矩阵中任何pos的最大值可以是48,并且剩余的值可能为0.

What needs to happen here is that for each pos in each row in the first matrix, for every row in the second matrix, if the pos (col) for each matrix is 1, then the final matrix count will go up 1. The highest value any pos in the final matrix can be is 48, and there is expected to be 0's left over.

示例:

mat1
     A B C D
1id1 0 1 0 1
1id2 1 1 0 0
1id3 1 1 1 1
1id4 0 0 1 0

mat2
     A B C D
2id1 1 1 0 0
2id2 0 1 1 0 
2id3 1 1 1 1 
2id4 1 0 1 0

final
     2id1 2id2 2id3 2id4
1id1   1    1    2    0
1id2   2    1    2    1
1id3   2    2    4    2
1id4   0    1    1    1

我有执行此操作的代码,但是它非常缓慢,这是我主要寻求帮助的地方.我试图尽可能快地提高算法速度.它已经运行了24小时,仅完成了大约25%.我已经让它运行完了,最终的输出文件是20GB.我对数据库没有经验,如果在下面的代码片段中有人可以帮助我实现这一点,我可以在这里实现.

I have code to do this, however it is painfully slow, which is where I'm mostly asking for help. I've tried to speed up the algorithm as best as possible. It's been running for 24hrs, and is only about 25% of the way through. I have let it run through before and the final output file is 20GB. I'm not experienced with databases, and could implement it here, if osomeone could assist me on how to do so given a snippet of the code below.

#!/usr/bin/env python

import sys

mat1in = sys.argv[1]
mat2in = sys.argv[2]

print '\n######################################################################################'
print 'Generating matrix by counts from smaller matrices.'
print '########################################################################################\n'

with open(mat1in, 'r') as f:
        cols = [''] + next(f).strip().split('\t')               # First line of matrix is composed of 48 cols
        mat1 = [line.strip().split('\t') for line in f]         # Each line in matrix = 'ID': 0 or 1 per col id

with open(mat2in, 'r') as f:
        next(f)                                                 # Skip first row, col IDs are taken from mat1
        mat2 = [line.strip().split('\t') for line in f]         # Each line in matrix = 'ID': 0 or 1 per col id

out = open('final_matrix.txt', 'w')                             # Output file

#matrix = []
header = []                                                     # Final matrix header
header.append('')                                               # Add blank as first char in large matrix header
for i in mat2:
        header.append(i[0])                                     # Composed of all mat2 row ids
#matrix.append(header)

print >> out, '\t'.join(header)                                 # First print header to output file

print '\nTotal mat1 rows: ' + str(len(mat1))                    # Get total mat1 rows
print 'Total mat2 rows: ' + str(len(mat2)), '\n'                # Get total mat2 rows
print 'Progress: '                                              # Progress updated as each mat1 id is read

length = len(header)                                            # Length of header, i.e. total number of mat2 ids
totmat1 = len(mat1)                                             # Length of rows (-header), i.e. total number of mat1 ids

total = 0                                                       # Running total - for progress meter
for h in mat1:                                                  # Loop through all mat1 ids - each row in the HC matrix
        row = []                                                # Empty list for new row for large matrix
        row.append(h[0])                                        # Append mat1 id, as first item in each row
        for i in xrange(length-1):                              # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
                row.extend('0')
        for n in xrange(1,49):                                  # Loop through each col id
                for k in mat2:                                  # For every row in mat2
                        if int(h[n]) == 1 and int(k[n]) == 1:   # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
                                pos = header.index(k[0])        # Get the position of the mat2 id
                                row[pos] = str(int(row[pos]) + 1)       # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
        print >> out, '\t'.join(row)                            # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
        total += 1                                              # Update running total
        sys.stdout.write('\r\t' + str(total) + '/' + str(tvh))  # Print progress to screen
        sys.stdout.flush()

print '\n######################################################################################'
print 'Matrix complete.'
print '########################################################################################\n'

以下是对mat1中id的前30个迭代进行概要分析的内容:

Here's what profiling the first 30 iterations for the ids in mat1:

######################################################################################
Generating matrix by counts from smaller matrices.
########################################################################################


Total mat1 rows: 108887
Total mat2 rows: 55482

Progress:
        30/108887^C         2140074 function calls in 101.234 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   70.176   70.176  101.234  101.234 build_matrix.py:3(<module>)
        4    0.000    0.000    0.000    0.000 {len}
    55514    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1719942    1.056    0.000    1.056    0.000 {method 'extend' of 'list' objects}
       30    0.000    0.000    0.000    0.000 {method 'flush' of 'file' objects}
    35776   29.332    0.001   29.332    0.001 {method 'index' of 'list' objects}
       31    0.037    0.001    0.037    0.001 {method 'join' of 'str' objects}
   164370    0.589    0.000    0.589    0.000 {method 'split' of 'str' objects}
   164370    0.033    0.000    0.033    0.000 {method 'strip' of 'str' objects}
       30    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {next}
        3    0.004    0.001    0.004    0.001 {open}

我还为每个迭代计时,每个mat1 id大约需要2.5-3s,如果我正确的话,大约需要90个小时才能完成整个过程.这是整个过程中整个脚本运行所需的时间.

I've also timed each iteration, which takes about 2.5-3s per mat1 id, and if I'm correct would take about 90hrs to complete the whole thing. This is about what it took to run the entire script all the way through.

我已经编辑了一些主要的位,以通过将'0'乘以标题的长度来一步一步地删除通过追加和xrange组成列表的行.我还制作了一个mat2 ids的字典,其中索引作为值,以为dict查找比索引快.

I've edited some of the main bits, to remove making the rows by append and xrange to making the list in one step by multipling '0' by the lengthof the headers. Also I made a dictionary of the mat2 ids with the index as values to, thinking dict lookup would be quicker than index..

headdict = {}
for k,v in enumerate(header):
        headdict[v] = k

total = 0                                                       # Running total - for progress meter
for h in mat1:                                                  # Loop through all mat1 ids - each row in the HC matrix
        timestart = time.clock()
        row = [h[0]] + ['0']*(length-1)                 # Empty list for new row for large matrix
        #row.append(h[0])                                       # Append mat1 id, as first item in each row
        #for i in xrange(length-1):                             # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
        #       row.append('0')
        for n in xrange(1,49):                                  # Loop through each col id
                for k in mat2:                                  # For every row in mat2
                        if int(h[n]) == 1 and int(k[n]) == 1:   # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
                                pos = headdict[k[0]] #header.index(k[0])        # Get the position of the mat2 id
                                row[pos] = str(int(row[pos]) + 1)       # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
        print >> out, '\t'.join(row)                            # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
        total += 1                                              # Update running total
        sys.stdout.write('\r\t' + str(total) + '/' + str(totmat1))  # Print progress to screen
        #sys.stdout.flush()
        timeend = time.clock()
        print timestart - timeend

推荐答案

这只是矩阵乘法.您想将第一个矩阵乘以第二个矩阵的转置.为了有效地进行矩阵运算,请获取 NumPy .

This is just a matrix multiplication. You want to multiply the first matrix by the transpose of the second. For efficient matrix operations, get NumPy.

如果您将两个输入矩阵读入dtype numpy.int8的NumPy数组中,则计算很简单

If you read your two input matrices into NumPy arrays of dtype numpy.int8, then the computation is simply

m1.dot(m2.T)

这需要几分钟,顶部.

这篇关于比较两个不同大小的矩阵以组成一个大矩阵-速度提高了吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆