python中最有效的直方图代码 [英] Most efficient histogram code in python

查看：337 发布时间：2020/11/23 6:38:34 python performance histogram tf-idf

本文介绍了python中最有效的直方图代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经看到了很多有关在干净的单线图中制作直方图的问题，但是我还没有找到任何人试图尽可能高效地制作直方图.我目前正在为搜索算法创建许多tfidf向量，这涉及到创建许多直方图和当前代码，但非常简短且可读性不如我想要的快.可悲的是，我尝试了许多其他方法，但结果却慢得多.你能做得更快吗? cleanStringVector是一个字符串列表(全部为小写字母，没有标点符号)，masterWordList也是一个单词列表，其中应包含cleanStringVector中的每个单词.

I've seen a number of questions on making histograms in clean one-liners, but I haven't yet found anyone trying to make them as efficiently as possible. I'm currently creating a lot of tfidf vectors for a search algorithm, and this involves creating a number of histograms and my current code, while being very short and readable is not as fast as I would like. Sadly, I've tried a number of other methods that turned out far slower. Can you do it faster? cleanStringVector is a list of strings (all lowercase, no punctuation), and masterWordList is also a list of words that should contain every word within the cleanStringVector.

from collections import Counter
def tfidfVector(cleanStringVector, masterWordList):
    frequencyHistogram = Counter(cleanStringVector)
    featureVector = [frequencyHistogram[word] for word in masterWordList]
    return featureVector

值得一提的是，Counter对象针对不存在的键返回零而不是引发KeyError的事实是一个严重的加法，其他问题中的大多数直方图方法均未通过该测试.

Worth noting that the fact that the Counter object returns a zero for non-existent keys instead of raising a KeyError is a serious plus and most of the histogram methods in other questions fail this test.

示例:如果我有以下数据:

Example: If I have the following data:

["apple", "orange", "tomato", "apple", "apple"]
["tomato", "tomato", "orange"]
["apple", "apple", "apple", "cucumber"]
["tomato", "orange", "apple", "apple", "tomato", "orange"]
["orange", "cucumber", "orange", "cucumber", "tomato"]

以及以下内容的主要字词列表:

And a master wordlist of:

["apple", "orange", "tomato", "cucumber"]

我希望分别从每个测试用例中返回以下内容:

I would like a return of the following from each test case respectively:

[3, 1, 1, 0]
[0, 1, 2, 0]
[3, 0, 0, 1]
[2, 2, 2, 0]
[0, 2, 1, 2]

我希望能帮上忙.

大概的最终结果:

Original Method: 3.213
OrderedDict: 5.529
UnorderedDict: 0.190

python中最有效的直方图代码 [英] Most efficient histogram code in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python中最有效的直方图代码 [英] Most efficient histogram code in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭