查找短语共现矩阵的有效算法 [英] efficient algorithm for finding co occurrence matrix of phrases

查看:96
本文介绍了查找短语共现矩阵的有效算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约40,000个短语的列表L和大约一千万个单词的文档。我要检查的是这四个短语中的哪两个在4个单词的窗口内出现。例如,考虑L = [棕狐,懒狗]。该文档中包含一只快速的棕色狐狸跳过懒狗的字样。我想看看,在四个字的窗口中出现了几次狐狸和懒狗,并将其存储在文件中。我有以下代码可以做到这一点:

I have a list L of around 40,000 phrases and a document of around 10 million words. what I want to check is which pair of these phrases co occur within a window of 4 words. For example, consider L=["brown fox","lazy dog"]. The document contains the words "a quick brown fox jumps over the lazy dog". I want to see, how many times brown fox and lazy dog appears within an window of four words and store that in a file. I have following code for doing this:

content=open("d.txt","r").read().replace("\n"," ");
for i in range(len(L)):
 for j in range(i+1,len(L)):
  wr=L[i]+"\W+(?:\w+\W+){1,4}"+L[j]
  wrev=L[j]+"\W+(?:\w+\W+){1,4}"+L[i]
  phrasecoccur=len(re.findall(wr, content))+len(re.findall(wrev,content))
  if (phrasecoccur>0):
    f.write(L[i]+", "+L[j]+", "+str(phrasecoccur)+"\n")

基本上,对于列表L中的每对短语,我正在检查文档内容,这些短语在4个单词的窗口中出现了多少次。但是,当列表L很大(例如40K个元素)时,此方法的计算效率较低。有更好的方法吗?

Essentially, for each pair of phrases in the list L, I am checking in the document content that how many times these phrases appear within an window of 4 words. However, this method is computationally inefficient when the list L is pretty large, like 40K elements. Is there a better way of doing this?

推荐答案

您可以使用与 Aho-Corasick字符串匹配算法。从短语列表中构建状态机。然后开始将单词输入状态机。每当发生匹配时,状态机都会告诉您匹配的短语和单词编号。因此,您的输出将类似于:

You could use something similar to the Aho-Corasick string matching algorithm. Build the state machine from your list of phrases. Then start feeding words into the state machine. Whenever a match occurs, the state machine will tell you which phrase matched and at what word number. So your output would be something like:

"brown fox", 3
"lazy dog", 8
etc.

您可以捕获所有输出并对其进行后处理,也可以处理发现的匹配项。

You can either capture all of the output and post-process it, or you can process the matches as they're found.

构建状态机需要花费一些时间(40,000个短语需要几秒钟),但是在此之后它是线性的输入令牌的数量,短语的数量和匹配的数量。

It takes a little time to build the state machine (a few seconds for 40,000 phrases), but after that it's linear in the number of input tokens, number of phrases, and number of matches.

我用类似的方法将5000万个YouTube视频标题与其中的数百万个歌曲标题和艺术家姓名进行匹配MusicBrainz数据库。很棒。而且很快。

I used something similar to match 50 million YouTube video titles against the several million song titles and artist names in the MusicBrainz database. Worked great. And very fast.

这篇关于查找短语共现矩阵的有效算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆