如何为TfidfVectorizer使用列表列表或集合列表? [英] How can I use a list of lists, or a list of sets, for the TfidfVectorizer?

查看:106
本文介绍了如何为TfidfVectorizer使用列表列表或集合列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 sklearn TfidfVectorizer 进行文本分类.

I'm using the sklearn TfidfVectorizer for text-classification.

我知道此矢量化程序希望将原始文本作为输入,但是使用列表是可行的(请参阅input1).

I know this vectorizer wants raw text as input, but using a list works (see input1).

但是,如果我要使用多个列表(或集合),则会出现以下 Attribute 错误.

However, if I want to use multiple lists (or sets) I get the following Attribute error.

有人知道如何解决这个问题吗?预先感谢!

Does anyone know how to tackle this problem? Thanks in advance!

    from sklearn.feature_extraction.text import TfidfVectorizer

    vectorizer = TfidfVectorizer(min_df=1, stop_words="english")
    input1 = ["This", "is", "a", "test"]
    input2 = [["This", "is", "a", "test"], ["It", "is", "raining", "today"]]

    print(vectorizer.fit_transform(input1)) #works
    print(vectorizer.fit_transform(input2)) #gives Attribute error

input 1:
  (3, 0)    1.0

input 2:

回溯(最近一次通话最后一次):文件",第1行,在 文件 "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", 第1381行,在fit_transform中 X =超级(TfidfVectorizer,self).fit_transform(raw_documents)文件 "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", 在fit_transform中的第869行 self.fixed_vocabulary_)文件"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", _count_vocab中的第792行 对于analyze(doc)中的功能:文件"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", 266行,在 tokenize(preprocess(self.decode(doc))),stop_words)文件"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", 232行,在 返回lambda x:strip_accents(x.lower())AttributeError:列表"对象没有属性降低"

Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform self.fixed_vocabulary_) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab for feature in analyze(doc): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 266, in tokenize(preprocess(self.decode(doc))), stop_words) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 232, in return lambda x: strip_accents(x.lower()) AttributeError: 'list' object has no attribute 'lower'

推荐答案

请注意,input1可以工作,但它会将列表(字符串)的每个元素视为要进行矢量化的不同文档.

在input2的情况下,我假设您要向量化每个句子"(子列表).一种解决方案是使用以下列表理解语法:

In the case of input2, I assume you want to vectorize each "sentence" (sublists). One solution is using the following list comprehension syntax:

input2_corrected = [" ".join(x) for x in input2]

产生

['This is a test', 'It is raining today']

不再产生AttributeError.

which does not yield the AttributeError anymore.

这篇关于如何为TfidfVectorizer使用列表列表或集合列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆