将sklearn TfidfVectorizer与已标记化的输入一起使用? [英] Use sklearn TfidfVectorizer with already tokenized inputs?

查看:280
本文介绍了将sklearn TfidfVectorizer与已标记化的输入一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带标记的句子列表,想适合tfidf Vectorizer.我尝试了以下方法:

I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]

def identity_tokenizer(text):
  return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')    
tfidf.fit_transform(tokenized_list_of_sentences)

错误信息为

AttributeError: 'list' object has no attribute 'lower'

有没有办法做到这一点?我有十亿个句子,不想再次标记它们.在此之前,将它们标记为另一个阶段.

is there a way to do this? I have a billion sentences and do not want to tokenize them again. They are tokenized before for another stage before this.

推荐答案

尝试使用参数lowercase=False初始化TfidfVectorizer对象(假设实际上是需要的,因为在上一阶段中将令牌小写).

Try initializing the TfidfVectorizer object with the parameter lowercase=False (assuming this is actually desired as you've lowercased your tokens in previous stages).

tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]

def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf.fit_transform(tokenized_list_of_sentences)

请注意,我更改了这些句子,因为它们显然只包含停用词,由于词汇量空缺而导致了另一个错误.

Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.

这篇关于将sklearn TfidfVectorizer与已标记化的输入一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆