AttributeError:'int'对象在TFIDF和CountVectorizer中没有属性"lower" [英] AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer

查看:111
本文介绍了AttributeError:'int'对象在TFIDF和CountVectorizer中没有属性"lower"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图预测输入消息的不同类别,并且我使用波斯语.我使用Tfidf和Naive-Bayes对输入数据进行分类.这是我的代码:

I tried to predict different classes of the entry messages and I worked on the Persian language. I used Tfidf and Naive-Bayes to classify my input data. Here is my code:

import pandas as pd
df=pd.read_excel('dataset.xlsx')
col=['label','body']
df=df[col]
df.columns=['label','body']
df['class_type'] = df['label'].factorize()[0]
class_type_df=df[['label','class_type']].drop_duplicates().sort_values('class_type')
class_type_id = dict(class_type_df.values)
id_to_class_type = dict(class_type_df[['class_type', 'label']].values)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
features=tfidf.fit_transform(df.body).toarray()
classtype=df.class_type
print(features.shape)
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB 
X_train,X_test,y_train,y_test=train_test_split(df['body'],df['label'],random_state=0)
cv=CountVectorizer()
X_train_counts=cv.fit_transform(X_train)
tfidf_transformer=TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
print(clf.predict(cv.transform(["خريد و فروش لوازم آرايشي از بانه"])))

但是当我运行上面的代码时,它会抛出以下异常,而我希望在输出中给我ads"类:

But when I run the above code it throws the following exception while I expect to give me "ads" class in the output:

回溯(最近一次通话最后一次):文件".../multiclass-main.py",第27行,在X_train_counts = cv.fit_transform(X_train)文件"... \ sklearn \ feature_extraction \ text.py",行1012,在fit_transform中self.fixed_vocabulary_) 文件...sklearn\feature_extraction\text.py",第 922 行,在 _count_vocab用于analyzer(doc)中的特征:文件"... sklearn \ feature_extraction \ text.py",第308行,在tokenize(preprocess(self.decode(doc))),stop_words)文件"... sklearn \ feature_extraction \ text.py",第256行,在返回lambda x:strip_accents(x.lower())AttributeError:'int'对象没有属性'lower'

Traceback (most recent call last): File ".../multiclass-main.py", line 27, in X_train_counts=cv.fit_transform(X_train) File "...\sklearn\feature_extraction\text.py", line 1012, in fit_transform self.fixed_vocabulary_) File "...sklearn\feature_extraction\text.py", line 922, in _count_vocab for feature in analyze(doc): File "...sklearn\feature_extraction\text.py", line 308, in tokenize(preprocess(self.decode(doc))), stop_words) File "...sklearn\feature_extraction\text.py", line 256, in return lambda x: strip_accents(x.lower()) AttributeError: 'int' object has no attribute 'lower'

在该项目中如何使用Tfidf和CountVectorizer?

how can I use Tfidf and CountVectorizer in this project?

推荐答案

您看到的错误是 AttributeError:'int'对象没有属性'lower',这意味着整数不能小写.在代码中的某个地方,它试图将小写的小数对象变为小写.

As you see the error is AttributeError: 'int' object has no attribute 'lower' which means integer cannot be lower-cased. Somewhere in your code, it tries to lower case integer object which is not possible.

为什么会这样?

CountVectorizer 构造函数的参数小写默认为True.当您调用 .fit_transform()时,它将尝试小写包含整数的输入.更具体地说,在您的输入数据中,您有一个整数对象的项目.例如,您的列表包含类似于以下内容的数据:

CountVectorizer constructor has parameter lowercase which is True by default. When you call .fit_transform() it tries to lower case your input that contains an integer. More specifically, in your input data, you have an item which is an integer object. E.g., your list contains data similar to:

 corpus = ['sentence1', 'sentence 2', 12930, 'sentence 100']

当您将上面的列表传递给 CountVectorizer 时,它将引发此类异常.

When you pass the above list to CountVectorizer it throws such exception.

如何解决?

以下是一些可以避免此问题的解决方案:

Here are some possible solution to avoid this problem:

1)将语料库中的所有行转换为字符串对象.

1) Convert all rows in your corpus to string object.

 corpus = ['sentence1', 'sentence 2', 12930, 'sentence 100']
 corpus = [str (item) for item in corpus]

2)删除语料库中的整数:

2) Remove integers in your corpus:

corpus = ['sentence1', 'sentence 2', 12930, 'sentence 100']
corpus = [item for item in corpus if not isinstance(item, int)]

这篇关于AttributeError:'int'对象在TFIDF和CountVectorizer中没有属性"lower"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆