sklearn:如何加快矢量化器(例如Tfidfvectorizer)的速度 [英] sklearn: How to speed up a vectorizer (eg Tfidfvectorizer)
问题描述
对程序进行彻底分析后,我能够确定矢量化程序正在减慢其速度.
After thoroughly profiling my program, I have been able to pinpoint that it is being slowed down by the vectorizer.
我正在处理文本数据,两行简单的tfidf字母组合向量化占用了代码执行总时间的99.2%.
I am working on text data, and two lines of simple tfidf unigram vectorization is taking up 99.2% of the total time the code takes to execute.
这是一个可运行的示例(它将3mb训练文件下载到您的磁盘上,省略urllib部分以在您自己的示例上运行):
Here is a runnable example (this will download a 3mb training file to your disk, omit the urllib parts to run on your own sample):
#####################################
# Loading Data
#####################################
import urllib
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.stem
raw = urllib.urlopen("https://s3.amazonaws.com/hr-testcases/597/assets/trainingdata.txt").read()
file = open("to_delete.txt","w").write(raw)
###
def extract_training():
f = open("to_delete.txt")
N = int(f.readline())
X = []
y = []
for i in xrange(N):
line = f.readline()
label,text = int(line[0]), line[2:]
X.append(text)
y.append(label)
return X,y
X_train, y_train = extract_training()
#############################################
# Extending Tfidf to have only stemmed features
#############################################
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
#############################################
# Line below takes 6-7 seconds on my machine
#############################################
Xv = tfidf.fit_transform(X_train)
我尝试将列表X_train
转换为np.array,但性能没有差异.
I tried converting the list X_train
into an np.array but there was no difference in performance.
推荐答案
不出所料,它是NLTK速度很慢:
Unsurprisingly, it's NLTK that is slow:
>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 4.89 s per loop
>>> tfidf = TfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 415 ms per loop
您可以使用更智能的Snowball提取器实现,例如 PyStemmer :
You can speed this up by using a smarter implementation of the Snowball stemmer, e.g., PyStemmer:
>>> import Stemmer
>>> english_stemmer = Stemmer.Stemmer('en')
>>> class StemmedTfidfVectorizer(TfidfVectorizer):
... def build_analyzer(self):
... analyzer = super(TfidfVectorizer, self).build_analyzer()
... return lambda doc: english_stemmer.stemWords(analyzer(doc))
...
>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 650 ms per loop
NLTK是一个教学工具包.它的设计速度很慢,因为它针对可读性进行了优化.
NLTK is a teaching toolkit. It's slow by design, because it's optimized for readability.
这篇关于sklearn:如何加快矢量化器(例如Tfidfvectorizer)的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!