python中最有效的直方图代码 [英] Most efficient histogram code in python
问题描述
我已经看到了很多有关在干净的单线图中制作直方图的问题,但是我还没有找到任何人试图尽可能高效地制作直方图.我目前正在为搜索算法创建许多tfidf向量,这涉及到创建许多直方图和当前代码,但非常简短且可读性不如我想要的快.可悲的是,我尝试了许多其他方法,但结果却慢得多.你能做得更快吗? cleanStringVector是一个字符串列表(全部为小写字母,没有标点符号),masterWordList也是一个单词列表,其中应包含cleanStringVector中的每个单词.
I've seen a number of questions on making histograms in clean one-liners, but I haven't yet found anyone trying to make them as efficiently as possible. I'm currently creating a lot of tfidf vectors for a search algorithm, and this involves creating a number of histograms and my current code, while being very short and readable is not as fast as I would like. Sadly, I've tried a number of other methods that turned out far slower. Can you do it faster? cleanStringVector is a list of strings (all lowercase, no punctuation), and masterWordList is also a list of words that should contain every word within the cleanStringVector.
from collections import Counter
def tfidfVector(cleanStringVector, masterWordList):
frequencyHistogram = Counter(cleanStringVector)
featureVector = [frequencyHistogram[word] for word in masterWordList]
return featureVector
值得一提的是,Counter对象针对不存在的键返回零而不是引发KeyError的事实是一个严重的加法,其他问题中的大多数直方图方法均未通过该测试.
Worth noting that the fact that the Counter object returns a zero for non-existent keys instead of raising a KeyError is a serious plus and most of the histogram methods in other questions fail this test.
示例:如果我有以下数据:
Example: If I have the following data:
["apple", "orange", "tomato", "apple", "apple"]
["tomato", "tomato", "orange"]
["apple", "apple", "apple", "cucumber"]
["tomato", "orange", "apple", "apple", "tomato", "orange"]
["orange", "cucumber", "orange", "cucumber", "tomato"]
以及以下内容的主要字词列表:
And a master wordlist of:
["apple", "orange", "tomato", "cucumber"]
我希望分别从每个测试用例中返回以下内容:
I would like a return of the following from each test case respectively:
[3, 1, 1, 0]
[0, 1, 2, 0]
[3, 0, 0, 1]
[2, 2, 2, 0]
[0, 2, 1, 2]
我希望能帮上忙.
大概的最终结果:
Original Method: 3.213
OrderedDict: 5.529
UnorderedDict: 0.190
推荐答案
使用Python 3,这在我的代表性微基准测试中的运行时间提高了1个数量级.
This improves the runtime in my unrepresentative micro benchmark by 1 order of magnitude with Python 3:
mapping = dict((w, i) for i, w in enumerate(masterWordList))
def tfidfVector(cleanStringVector, masterWordList):
featureVector = [0] * len(masterWordList)
for w in cleanStringVector:
featureVector[mapping[w]] += 1
return featureVector
这篇关于python中最有效的直方图代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!