来自CSV文件的R文本挖掘文档(每个文档一行) [英] R text mining documents from CSV file (one row per doc)

查看:148
本文介绍了来自CSV文件的R文本挖掘文档(每个文档一行)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用R中的tm包,并有一个客户反馈CSV文件,每一行都是不同的反馈实例.我想将此反馈的所有内容导入到语料库中,但我希望每一行都是语料库中的一个不同文档,以便可以在DocTerms矩阵中比较反馈.我的数据集中有超过10,000行.

I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set.

最初我做了以下事情:

fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t")

这将创建一个包含1个文档和10,000行以上的语料库,而我想要> 10,000个文档,每个都包含1行.

This creates a corpus with 1 document and >10,000 rows, and I want >10,000 docs with 1 row each.

我想我可以在一个文件夹中容纳10,000多个单独的CSV或TXT文档,然后从中创建一个语料库...但是我认为有一个比这简单得多的答案,将每一行作为一个单独的文档读取.

I imagine I could just have 10,000+ separate CSV or TXT documents within a folder and create a corpus from that... but I'm thinking there is a much simpler answer than that, reading each line as a separate document.

推荐答案

以下是获取所需内容的完整工作流程:

Here's a complete workflow to get what you want:

# change this file location to suit your machine
file_loc <- "C:\\Documents and Settings\\Administrator\\Desktop\\Book1.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)

dtm对象中,每一行将是一个doc或原始CSV文件的一行.每列将是一个单词.

In the dtm object each row will be a doc, or a line of your original CSV file. Each column will be a word.

这篇关于来自CSV文件的R文本挖掘文档(每个文档一行)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆