使用scikit-learn加载文本数据时遇到问题? [英] Problems loading textual data with scikit-learn?

查看:165
本文介绍了使用scikit-learn加载文本数据时遇到问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用自己的数据将某些数据分为两类,所以让我们:

I'm using my own data to classify into two categories some data, so let:

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load the text data
categories = [
    'CLASS_1',
    'CLASS_2',
]

text_train_subset = load_files('train',
    categories=categories)

text_test_subset = load_files('test',
    categories=categories)

# Turn the text documents into vectors of word frequencies
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train_subset)
y_train = text_train_subset.target


classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set
X_test = vectorizer.transform(text_test_subset.data)
y_test = text_test_subset.target
print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))

对于上述代码和文档,我有以下目录架构:

For the above code and the documentation, I have the following directory schema:

data_folder/

    train_folder/
        CLASS_1.txt CLASS_2.txt
    test_folder/
        test.txt

然后我收到此错误:

    % (size, n_samples))
ValueError: Found array with dim 0. Expected 5

我也尝试过fit_transform,但还是一样.我该如何解决这个尺寸问题?

I also tried fit_transform but still the same. How can I solve this dimession problem?

推荐答案

第一个问题是您的目录结构错误. 您需要像这样

The first problem is you've got the wrong directory structure. You need it to be like

container_folder/
    CLASS_1_folder/
        file_1.txt, file_2.txt ... 
    CLASS_2_folder/
        file_1.txt, file_2.txt, ....

您需要在此目录结构中同时设置训练和测试.或者,您可以将所有数据放在一个目录中,并使用 train_test_split 将其一分为二.

You need to have both the train and test set in this directory structure. Alternatively, you can have all data in one directory and use train_test_split to split it in two.

其次,

X_train = vectorizer.fit_transform(text_train_subset)

需要成为

X_train = vectorizer.fit_transform(text_train_subset.data) # added .data

这是一个完整且有效的示例:

Here is a complete and working example:

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

text_train_subset = load_files('sample-data/web')
text_test_subset = text_train_subset # load your actual test data here

# Turn the text documents into vectors of word frequencies
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train_subset.data)
y_train = text_train_subset.target


classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set
X_test = vectorizer.transform(text_test_subset.data)
y_test = text_test_subset.target
print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))

sample-data/web的目录结构是

sample-data/web
├── de
│   ├── apollo8.txt
│   ├── fiv.txt
│   ├── habichtsadler.txt
└── en
    ├── elizabeth_needham.txt
    ├── equipartition_theorem.txt
    ├── sunderland_echo.txt
    └── thespis.txt

这篇关于使用scikit-learn加载文本数据时遇到问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆