将sklearn TfidfVectorizer与已标记化的输入一起使用? [英] Use sklearn TfidfVectorizer with already tokenized inputs?
问题描述
我有一个带标记的句子列表,想适合tfidf Vectorizer.我尝试了以下方法:
I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:
tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]
def identity_tokenizer(text):
return text
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')
tfidf.fit_transform(tokenized_list_of_sentences)
错误信息为
AttributeError: 'list' object has no attribute 'lower'
有没有办法做到这一点?我有十亿个句子,不想再次标记它们.在此之前,将它们标记为另一个阶段.
is there a way to do this? I have a billion sentences and do not want to tokenize them again. They are tokenized before for another stage before this.
推荐答案
尝试使用参数lowercase=False
初始化TfidfVectorizer
对象(假设实际上是需要的,因为在上一阶段中将令牌小写).>
Try initializing the TfidfVectorizer
object with the parameter lowercase=False
(assuming this is actually desired as you've lowercased your tokens in previous stages).
tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]
def identity_tokenizer(text):
return text
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
tfidf.fit_transform(tokenized_list_of_sentences)
请注意,我更改了这些句子,因为它们显然只包含停用词,由于词汇量空缺而导致了另一个错误.
Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.
这篇关于将sklearn TfidfVectorizer与已标记化的输入一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!