TFIDF 向量化器给出错误 [英] TFIDF Vectorizer giving error

查看:42
本文介绍了TFIDF 向量化器给出错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 TFIDF 和 SVM 对某些文件进行文本分类.每次选择 3 个词的功能.我的数据文件已经是这样的格式:angel eyes has, each one for, on its own.没有停用词,也不能进行旅鼠或词干提取.我希望将功能选为:天使眼具有...我写的代码如下:

I am trying to carry out text classification for certain files using TFIDF and SVM. The features are to be selected 3 words at a time . My data files is already in the format : angel eyes has, each one for, on its own. There are no stop words and neither can do lemming or stemming. I want the feature to be selected as: angel eyes has ... The code that I have written is given below:

import os
import sys
import numpy
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.cross_validation import train_test_split

dt=load_files('C:/test4',load_content=True)
d= len(dt)
print dt.target_names
X, y = dt.data, dt.target
print y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print y_train
vectorizer = CountVectorizer()
z= vectorizer.fit_transform(X_train)
tfidf_vect= TfidfVectorizer(lowercase= True, tokenizer=',', max_df=1.0, min_df=1, max_features=None, norm=u'l2', use_idf=True, smooth_idf=True, sublinear_tf=False)


X_train_tfidf = tfidf_vect.fit_transform(z)

print tfidf_vect.get_feature_names()
svm_classifier = LinearSVC().fit(X_train_tfidf, y_train)

不幸的是,我在X_train_tfidf = tfidf_vect.fit_transform(z)"处遇到错误:AttributeError: 未找到下层.
如果我修改代码来做

Unfortunately I am getting an error at" X_train_tfidf = tfidf_vect.fit_transform(z)" : AttributeError: lower not found .
If I modifiy code to do

tfidf_vect= TfidfVectorizer( tokenizer=',', use_idf=True, smooth_idf=True, sublinear_tf=False)
print "okay2"
#X_train_tfidf = tfidf_transformer.fit_transform(z)
X_train_tfidf = tfidf_vect.fit_transform(X_train)
print X_train_tfidf.getfeature_names()

我收到错误:TypeError: 'str' object is not callable请有人告诉我我哪里错了

I get the error : TypeError: 'str' object is not callable Can please someone tell me where am I going wrong

推荐答案

tokenizer 参数的输入是可调用的.尝试定义一个可以适当标记数据的函数.如果是逗号分隔,则:

the input to the tokenizer paramter is a callable. Try defining a function that will tokenize your data appropriately. If it is comma delimited then:

def tokens(x):
return x.split(',')

应该可以.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer( tokenizer=tokens ,use_idf=True, smooth_idf=True, sublinear_tf=False)

创建一个由 ,

 a=['cat on the,angel eyes has,blue red angel,one two blue,blue whales eat,hot tin roof']

tfidf_vect.fit_transform(a)
tfidf_vect.get_feature_names()

返回

Out[73]:

[u'angel eyes has',
 u'blue red angel',
 u'blue whales eat',
 u'cat on the',
 u'hot tin roof',
 u'one two blue']

这篇关于TFIDF 向量化器给出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆