Gensim LDA多核Python脚本运行太慢 [英] Gensim LDA Multicore Python script runs much too slow

查看:221
本文介绍了Gensim LDA多核Python脚本运行太慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在大型数据集(大约100 000个项目)上运行以下python脚本.当前执行速度慢得令人无法接受,至少可能需要一个月才能完成(不夸张).显然,我希望它运行得更快.

I'm running the following python script on a large dataset (around 100 000 items). Currently the execution is unacceptably slow, it would probably take a month to finish at least (no exaggeration). Obviously I would like it to run faster.

我添加了一条注释,以突出显示我认为瓶颈所在的位置.我已经编写了自己的导入数据库功能.

I've added a comment belong to highlight where I think the bottleneck is. I have written my own database functions which are imported.

感谢您的帮助!

# -*- coding: utf-8 -*-
import database
from gensim import corpora, models, similarities, matutils
from gensim.models.ldamulticore import LdaMulticore
import pandas as pd
from sklearn import preprocessing



def getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary):
    vec_bow = dictionary.doc2bow([researcher['full_proposal_text']])
    vec_lda = ldamodel[vec_bow]

    # normalization
    try:
        vec_lda = preprocessing.normalize(vec_lda)
    except:
        pass

    similar_authors = []

    for index, other_author in authors.iterrows():
        if(other_author['id'] != author['id']):
            other_vec_bow = dictionary.doc2bow([other_author['full_proposal_text']])

            other_vec_lda = ldamodel[other_vec_bow]
            # normalization
            try:
                other_vec_lda = preprocessing.normalize(vec_lda)
            except:
                pass

            sim = matutils.cossim(vec_lda, other_vec_lda)
            similar_authors.append({'id': other_author['id'], 'cosim': sim})
    similar_authors = sorted(similar_authors, key=lambda k: k['cosim'], reverse=True)
    return similar_authors[:5]


def get_top_five_similar(author, authors, ldamodel, dictionary):
    top_five_similar_authors = getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary)
    database.insert_top_five_similar_authors(author['id'], top_five_similar_authors, cursor)

connection = database.connect()
authors = []
authors = pd.read_sql("SELECT id, full_text FROM author WHERE full_text IS NOT NULL;", connection)

# create the dictionary
dictionary = corpora.Dictionary([authors["full_text"].tolist()])

# create the corpus/ldamodel
author_text = []

for text in author_text['full_text'].tolist():
    word_list = []
    for word in text:
        word_list.append(word)
        author_text.append(word_list)

corpus = [dictionary.doc2bow(text) for text in author_text]
ldamodel = LdaMulticore(corpus, num_topics=50, id2word = dictionary, workers=30)

#BOTTLENECK: the script hangs after this point. 
authors.apply(lambda x: get_top_five_similar(x, authors, ldamodel, dictionary), axis=1)

推荐答案

我在您的代码中注意到了这些问题.但是我不确定它们是否是执行缓慢的原因. 这个循环是没有用的,它永远不会运行:

I noticed these problems in your code.. but I'm not sure the they are the reason for the slow execution.. this loop here is useless it well never run:

 for text in author_text['full_text'].tolist():
      word_list = []
      for word in text:
         word_list.append(word)
         author_text.append(word_list)

同样也不需要循环文本中的单词,只需在其上使用split函数就可以了,这将是一个单词列表,这是通过甩开作者courser来实现的.

also there is no need to loop the words of the text it is enough to use split function on it and it will be a list of words, by lopping authors courser..

尝试这样写: 首先:

all_authors_text = []
for author in authors:
    all_authors_text.append(author['full_text'].split())

然后创建字典:

dictionary = corpora.Dictionary(all_authors_text)

这篇关于Gensim LDA多核Python脚本运行太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆