Python和tfidf算法，使其更快？ [英] Python and tfidf algorithm, make it faster?

查看：966 发布时间：2017/5/21 20:56:27 python dictionary tf-idf

本文介绍了Python和tfidf算法，使其更快？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在网路应用程式中执行 tf-idf 算法使用Python，但是运行速度非常慢。我基本上做的是：

1）创建2个字典：

字典：键（文档ID），值（doc中所有找到的单词（包括重复的列表））

第二个字典;键（文档ID），值（包含文档的唯一字词的集合）

现在，有一个用户请求获取tfidf文件结果d。我做的是：

2）循环文档d的第二个字典的唯一字，并为每个唯一的字w获取：

2.1）tf分数（文档第一个字典的单词列表中的d：循环中出现多少次）

2.2）df分数（多少文档包含w：循环所有文档的单词集（第二个字典），并检查是否包含w）。我正在使用一个集合，因为检查一个集合是否包含与列表相比较的单词似乎更快。

步骤2.2非常慢。例如，拥有1000个文档，对于具有2313个独特词语的文档，输出结果大约需要5分钟。

有什么其他方法可以让步骤2.2更快吗？字典是否慢慢迭代？

解决方案

嗯，你必须重新思考和重新设计你的方式数据，或者换句话说，实现反向索引的正统版本。

您的瓶颈是文档的即时计算频率（DF）。这将是一个聪明的想法，这是动态的，所以每次更新您的语料库（文档集合），做一些处理和更新文档中每个术语的DF（当然，以持久的方式保存结果，还有一个数据库等）。

您需要的唯一结构是一个这样的嵌套字典

  {term1：{DF：x，some_doc_id：tf，some_other_doc_id：tf等等，
term2：... 
等.. 
}

每当你喂你的语料库时， / p>

当然，保留你的语料库基数...

作为我的工作的一个兴趣和一部分，我正在执行一个python - redis支持的小型搜索引擎。你也可以得到一些其他想法。看看这里。

I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is:

1) Create 2 dictionaries:

First dictionary: key (document id), value (list of all found words (incl. repeated) in doc)
Second dictionary; key (document id), value (set containing unique words of the doc)

Now, there is a petition of a user to get tfidf results of document d. What I do is:

2) Loop over the unique words of the second dictionary for the document d, and for each unique word w get:

2.1) tf score (how many times w appears in d: loop over the the list of words of the first dictionary for the document)

2.2) df score (how many docs contain w: looping over the set of words of all documents (second dictionary) and check if w is contained). I am using a set since it seems to be faster for checking if a set contains a word compared to a list.

Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes around 5 minutes to output the results.

Is there any other way to make step 2.2 faster? Are dictionaries that slow for iterating?

解决方案

Well, you have to re-think and re-design somehow the way you hold your data, or in other words, implement an "orthodox" version of your "inverted index".

Your bottleneck is the "on-the-fly" calculation of the document frequency (DF) for the terms. It would be a clever idea for this to be dynamic, so every time you update your corpus (collection of documents), do some processing and update the DFs for every term in a document (and of course, save the results in a persistent way, aka a database etc..) .

The only structure you need is a nested dictionary like that

{ "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc  } ,
  "term2" : ...
  etc..
}

properly updated every time you "feed" your corpus.

And of course, keep somewhere your corpus cardinality...

As a hobby and part of my work, I am implementing a python - redis backed small search engine. You might get some other ideas as well. Take a look here.

这篇关于Python和tfidf算法，使其更快？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python和tfidf算法，使其更快？ [英] Python and tfidf algorithm, make it faster?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python和tfidf算法，使其更快？ [英] Python and tfidf algorithm, make it faster?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭