为什么我的多线程程序运行缓慢? [英] Why my multi-threading program is slow?
问题描述
我正在尝试使用线程使程序运行更快,但是这花费了太多时间.代码必须计算两种矩阵(word_level,其中我将查询的每两个单词与一个文档进行比较,sequence_level:我将查询与文档中的不同序列进行比较.以下是主要功能:
I'm trying to make my program run faster, using threads but it takes too many time. The code must compute two kinds of matrices (word_level where I compare every two words of the query and a document, sequence_level: where I compare the query to different sequences on the document. Here are the principal functions:
import threading
from threading import Thread
def sim_QxD_word(query, document, model, alpha, outOfVocab, lock): #word_level
sim_w = {}
for q in set(query.split()):
sim_w[q] = {}
qE = []
if q in model.vocab:
qE = model[q]
elif q in outOfVocab:
qE = outOfVocab[q]
else:
qE = numpy.random.rand(model.layer1_size) # random vector
lock.acquire()
outOfVocab[q] = qE
lock.release()
for d in set(document.split()):
dE = []
if d in model.vocab:
dE = model[d]
elif d in outOfVocab:
dE = outOfVocab[d]
else:
dE = numpy.random.rand(model.layer1_size) # random vector
lock.acquire()
outOfVocab[d] = dE
lock.release()
sim_w[q][d] = sim(qE,dE,alpha)
return (sim_w, outOfVocab)
def sim_QxD_sequences(query, document, model, outOfVocab, alpha, lock): #sequence_level
# 1. extract document sequences
document_sequences = []
for i in range(len(document.split())-len(query.split())):
document_sequences.append(" ".join(document.split()[i:i+len(query.split())]))
# 2. compute similarities with a query sentence
lock.acquire()
query_vec, outOfVocab = avg_sequenceToVec(query, model, outOfVocab, lock)
lock.release()
sim_QxD = {}
for s in document_sequences:
lock.acquire()
s_vec, outOfVocab = avg_sequenceToVec(s, model, outOfVocab, lock)
lock.release()
sim_QxD[s] = sim(query_vec, s_vec, alpha)
return (sim_QxD, outOfVocab)
def word_level(q_clean, d_text, model, alpha, outOfVocab, out_w, q, ext_id, lock):
#print("in word_level")
sim_w, outOfVocab = sim_QxD_word(q_clean, d_text, model, alpha, outOfVocab, lock)
numpy.save(join(out_w, str(q)+ext_id+"word_interactions.npy"), sim_w)
def sequence_level(q_clean, d_text, model, outOfVocab, alpha, out_s, q, ext_id, lock):
#print("in sequence_level")
sim_s, outOfVocab = sim_QxD_sequences(q_clean, d_text, model, outOfVocab, alpha, lock)
numpy.save(join(out_s, str(q)+ext_id+"sequence_interactions.npy"), sim_s)
def extract_AllFeatures_parall(q_clean, d_text, model, alpha, outOfVocab, out_w, q, ext_id, out_s, lock):
#print("in extract_AllFeatures")
thW=Thread(target = word_level, args=(q_clean, d_text, model, alpha, outOfVocab, out_w, q, ext_id, lock))
thW.start()
thS=Thread(target = sequence_level, args=(q_clean, d_text, model, outOfVocab, alpha, out_s, q, ext_id, lock))
thS.start()
thW.join()
thS.join()
def process_documents(documents, index, model, alpha, outOfVocab, out_w, out_s, queries, stemming, stoplist, q):
#print("in process_documents")
q_clean = clean(queries[q],stemming, stoplist)
lock = threading.Lock()
for d in documents:
ext_id, d_text = reaDoc(d, index)
extract_AllFeatures_parall(q_clean, d_text, model, alpha, outOfVocab, out_w, q, ext_id, out_s, lock)
outOfVocab={} # shared variable over all threads
queries = {"1":"first query", ...} # can contain 200 elements
....
threadsList = []
for q in queries.keys():
thread = Thread(target = process_documents, args=(documents, index, model, alpha, outOfVocab, out_w, out_s, queries, stemming, stoplist, q))
thread.start()
threadsList.append(thread)
for th in threadsList:
th.join()
如何优化各种功能以使其运行更快? 预先感谢您的答复.
How can I optimize the different functions to make it run faster? Thanks in advance for responding.
推荐答案
在此答案中,我仅关注这些代码行
I'm just going to focus on these lines of code in this answer
thread = Thread(target = process_documents(documents, index, model, alpha, outOfVocab, out_w, out_s, queries, stemming, stoplist, q))
thread.start()
从文档 https://docs.python.org/2/library/threading.html
target是run()方法要调用的可调用对象. 默认为无",表示什么都不叫.
target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
目标应为可通话.在您的代码中,您传递了对 process_documents 的调用结果.您要说的是 target = process_documents (即传递函数本身-这是可调用的),并根据需要传递args/kwargs.
Target should be a callable. In your code you are passing in the result of a call to process_documents. What you want to do is say target=process_documents (i.e. pass in the function itself - which is a callable) and also pass in the args/kwargs as needed.
在您的代码按顺序运行时,对process_documents的每次调用都在同一线程中进行.您需要为线程提供所需的工作,而不是工作的结果.
At the moment your code is running sequentially, every call to process_documents is happening the same thread. You need to give the thread the job you want it to do, not the result of the job.
这篇关于为什么我的多线程程序运行缓慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!