在R中使用TM软件包的VCorpus时遇到错误 [英] Error faced while using TM package's VCorpus in R
问题描述
在使用R处理TM软件包时,我遇到以下错误.
I am facing the below error while working on the TM package with R.
library("tm")
Loading required package: NLP
Warning messages:
1: package ‘tm’ was built under R version 3.4.2
2: package ‘NLP’ was built under R version 3.4.1
corpus <- VCorpus(DataframeSource(data))
错误:all(!is.na(match(c("doc_id","text"),names(x))))不正确
Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE
尝试了多种方法,例如重新安装软件包,使用R的新版本进行更新,但错误仍然存在.对于相同的数据文件,相同的代码在具有相同R版本的另一个系统上运行.
Have tried various ways like reinstalling the package, updating with new version of R but the error still persists. For the same data file the same code runs on another system with the same version of R.
推荐答案
我将tm
软件包更新为0.7-2版本时遇到了同样的问题.
我查找了DataframeSource()
的详细信息,它提到了:
I met the same problem when I updated the tm
package to 0.7-2 version.
I looked for details of DataframeSource()
, it mentioned:
第一列必须命名为"doc_id",并且每个文档均包含唯一的字符串标识符.第二列必须命名为文本".
The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text".
详细信息
数据帧源将数据帧x的每一行解释为一个文档.第一列必须命名为"doc_id",并且每个文档均包含唯一的字符串标识符.第二列必须命名为文本",并包含代表文档内容的"UTF-8"编码字符串.可选的其他列用作文档级元数据.
A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata.
我用以下代码解决了它:
I solved it with the following code:
df_cmp<- read.csv("test_file.csv",stringsAsFactors = F)
df_title <- data.frame(doc_id=row.names(df_cmp),
text=df_cmp$English.title)
您可以尝试将列名称更改为doc_id
和text
.
You can try and change the column names to doc_id
and text
.
这篇关于在R中使用TM软件包的VCorpus时遇到错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!