python中最有效的直方图代码 [英] Most efficient histogram code in python

查看:337
本文介绍了python中最有效的直方图代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到了很多有关在干净的单线图中制作直方图的问题,但是我还没有找到任何人试图尽可能高效地制作直方图.我目前正在为搜索算法创建许多tfidf向量,这涉及到创建许多直方图和当前代码,但非常简短且可读性不如我想要的快.可悲的是,我尝试了许多其他方法,但结果却慢得多.你能做得更快吗? cleanStringVector是一个字符串列表(全部为小写字母,没有标点符号),masterWordList也是一个单词列表,其中应包含cleanStringVector中的每个单词.

I've seen a number of questions on making histograms in clean one-liners, but I haven't yet found anyone trying to make them as efficiently as possible. I'm currently creating a lot of tfidf vectors for a search algorithm, and this involves creating a number of histograms and my current code, while being very short and readable is not as fast as I would like. Sadly, I've tried a number of other methods that turned out far slower. Can you do it faster? cleanStringVector is a list of strings (all lowercase, no punctuation), and masterWordList is also a list of words that should contain every word within the cleanStringVector.

from collections import Counter
def tfidfVector(cleanStringVector, masterWordList):
    frequencyHistogram = Counter(cleanStringVector)
    featureVector = [frequencyHistogram[word] for word in masterWordList]
    return featureVector

值得一提的是,Counter对象针对不存在的键返回零而不是引发KeyError的事实是一个严重的加法,其他问题中的大多数直方图方法均未通过该测试.

Worth noting that the fact that the Counter object returns a zero for non-existent keys instead of raising a KeyError is a serious plus and most of the histogram methods in other questions fail this test.

示例:如果我有以下数据:

Example: If I have the following data:

["apple", "orange", "tomato", "apple", "apple"]
["tomato", "tomato", "orange"]
["apple", "apple", "apple", "cucumber"]
["tomato", "orange", "apple", "apple", "tomato", "orange"]
["orange", "cucumber", "orange", "cucumber", "tomato"]

以及以下内容的主要字词列表:

And a master wordlist of:

["apple", "orange", "tomato", "cucumber"]

我希望分别从每个测试用例中返回以下内容:

I would like a return of the following from each test case respectively:

[3, 1, 1, 0]
[0, 1, 2, 0]
[3, 0, 0, 1]
[2, 2, 2, 0]
[0, 2, 1, 2]

我希望能帮上忙.

大概的最终结果:

Original Method: 3.213
OrderedDict: 5.529
UnorderedDict: 0.190

推荐答案

使用Python 3,这在我的代表性微基准测试中的运行时间提高了1个数量级.

This improves the runtime in my unrepresentative micro benchmark by 1 order of magnitude with Python 3:

mapping = dict((w, i) for i, w in enumerate(masterWordList))

def tfidfVector(cleanStringVector, masterWordList):    
    featureVector = [0] * len(masterWordList)
    for w in cleanStringVector:
        featureVector[mapping[w]] += 1
    return featureVector

这篇关于python中最有效的直方图代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆