将分类数据从CSV加载到Scikit-Learn以进行机器学习 [英] Load classified data from CSV to Scikit-Learn for machine learning

查看:214
本文介绍了将分类数据从CSV加载到Scikit-Learn以进行机器学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习Scikit-学习对推文进行一些分类.我有一列带有推文的csv,下一列是从0-11开始的类.我经历了来自Scikit-Learn网站的本教程,我认为我了解实际分类是如何完成的,但我认为我并不真正了解数据格式.在教程中,资料位于文件夹中的文件中,其中文件夹名称充当分类标签.

I'm learning Scikit-Learn to do some classifying for tweets. I have a csv with tweets on one column, and their class from 0-11 in next column. I went through this tutorial from Scikit-Learn site I think I understand how the actual classifying is done but I don't think I really understood the data format. In tutorial the material was in files in folders where folder names acted as a classification tag.

在我的情况下,我应该从csv文件加载该数据,显然我需要构造一个数据结构,该数据结构将手动输入到矢量化器和分类器中.我应该如何处理?我认为该教程在这方面有点模棱两可,因为数据加载是自动完成的,而我却对自定义数据的结构和加载一无所知.

In my case I should load that data from csv file and apparently I need to construct the datastructure which is feed to vectorizer and classifier manually. How I should approach this? I think the tutorial was a bit ambiguous in this respect since the data loading was done automagically and left me in dark concerning the structure and loading of custom data.

推荐答案

通常,您将使用 numpy.load ,甚至使用标准库将cvs加载到列表中.看起来像这样:

Normally you would use pandas.read_csv or if you don't want a pandas dependency numpy.load or even load the cvs to a list using the standard library. It would look like this:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('example.csv', header=None, sep=',', 
                 names=['tweets', 'class'])   # columns names if no header
vect = TfidfVectorizer()
X = vect.fit_transform(df['tweets']) 
y = df['class']

一旦拥有Xy,就可以将它们提供给分类器.

Once you have your X and y you can feed them to a classifier.

这篇关于将分类数据从CSV加载到Scikit-Learn以进行机器学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆