将sklearn TfidfVectorizer与已标记化的输入一起使用? [英] Use sklearn TfidfVectorizer with already tokenized inputs?

查看：280 发布时间：2020/7/11 0:36:49 scikit-learn tfidfvectorizer

本文介绍了将sklearn TfidfVectorizer与已标记化的输入一起使用?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带标记的句子列表，想适合tfidf Vectorizer.我尝试了以下方法:

I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]

def identity_tokenizer(text):
  return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')    
tfidf.fit_transform(tokenized_list_of_sentences)

错误信息为

AttributeError: 'list' object has no attribute 'lower'

有没有办法做到这一点?我有十亿个句子，不想再次标记它们.在此之前，将它们标记为另一个阶段.

is there a way to do this? I have a billion sentences and do not want to tokenize them again. They are tokenized before for another stage before this.

推荐答案

尝试使用参数lowercase=False初始化TfidfVectorizer对象(假设实际上是需要的，因为在上一阶段中将令牌小写).

Try initializing the TfidfVectorizer object with the parameter lowercase=False (assuming this is actually desired as you've lowercased your tokens in previous stages).

tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]

def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf.fit_transform(tokenized_list_of_sentences)

请注意，我更改了这些句子，因为它们显然只包含停用词，由于词汇量空缺而导致了另一个错误.

Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.

这篇关于将sklearn TfidfVectorizer与已标记化的输入一起使用?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将sklearn TfidfVectorizer与已标记化的输入一起使用? [英] Use sklearn TfidfVectorizer with already tokenized inputs?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将sklearn TfidfVectorizer与已标记化的输入一起使用? [英] Use sklearn TfidfVectorizer with already tokenized inputs?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭