python sklearn pipiline fit:“属性错误:未找到下限" [英] python sklearn pipiline fit: "AttributeError: lower not found"

查看:74
本文介绍了python sklearn pipiline fit:“属性错误:未找到下限"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 sklearn 将 sveveral 文本数据分为 3 个类别.但我得到了

I'm trying to classify sveveral text data into 3 categories using sklearn. But I'm getting

属性错误:未找到下层"

"AttributeError: lower not found"

运行时.

代码:

train, test = train_test_split(df, random_state=42, test_size=0.3, shuffle=True)
X_train = train.contents
X_test = test.contents
Y_train = train.category
Y_test = test.category

clf_svc = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfVectorizer(tokenizer=',', use_idf=True, stop_words="english")),
                    ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
                    ])

clf_svc = clf_svc.fit(X_train, Y_train)
predicted_svc = clf_svc(X_test)
print(np.mean(predicted_svc == Y_test))

Dataframe (df) 由 2 列组成:内容(长文本数据)和类别(文本数据).内容是抓取的文本,因此包含数十或数百个单词,类别为单个单词,例如A",B".

Dataframe (df) consists of 2 columns: contents (long text data) and categories (text data). contents are scraped texts thus contain tens or hundreds of words, and categories are single words such as "A", "B".

我已经在 stackoverflow 中检查了过去的问题,但我无法解决发生的这个错误.
我很高兴知道解决方案或代码本身的问题.
任何建议和答案将不胜感激.

I've already checked past questions in stackoverflow but I could not solve this error occuring.
I'd be very glad to know the solution, or problems in the code itself.
Any advice and answers will be greatly appreciated.

提前致谢.

推荐答案

删除步骤 ('vect', CountVectorizer()) 或使用 TfidfTransformer 而不是 >TfidfVectorizer as TfidfVectorizer 需要字符串数组作为输入,CountVectorizer() 返回一个出现矩阵(即数字矩阵).

Either remove step ('vect', CountVectorizer()) or use TfidfTransformer instead of TfidfVectorizer as TfidfVectorizer expects array of strings as an input and CountVectorizer() returns a matrix of occurances (i.e. numeric matrix).

默认 TfidfVectorizer(...,lowercase=True) 将尝试小写"所有字符串,因此 AttributeError:lower not found" 错误消息.

Per default TfidfVectorizer(..., lowercase=True) will try to "lowercase" all strings, hence the "AttributeError: lower not found" error message.

Also 参数 tokenizer 需要一个可调用的(函数)或 None,所以不要指定它.

Also parameter tokenizer expects either a callable (function) or None, so don't specify it.

这篇关于python sklearn pipiline fit:“属性错误:未找到下限"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆