如何在语料库中手动设置文档 ID? [英] How can I manually set the document id in a corpus?

查看:27
本文介绍了如何在语料库中手动设置文档 ID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从数据帧创建 Copus.我将它作为 VectorSource 传递,因为我只想将一列用作文本源.这可以找到,但是我需要语料库中的文档 ID 来匹配数据框中的文档 ID.文档 ID 存储在原始数据框中的单独列中.

I am creating a Copus from a dataframe. I pass it as a VectorSource as there is only one column I want to be used as the text source. This works find however I need the document ids within the corpus to match the document ids from the dataframe. The document ids are stored in a separate column in the original dataframe.

df <- as.data.frame(t(rbind(c(1,3,5,7,8,10), 
                        c("text", "lots of text", "too much text", "where will it end",         "give peas a chance","help"))))
colnames(df) <- c("ids","textColumn")
library("tm")
library("lsa")
corpus <- Corpus(VectorSource(df[["textColumn"]]))

运行此代码会创建一个语料库,但文档 ID 从 1 到 6 运行.有没有办法创建文档 ID 为 1、3、5、7、8、10 的语料库?

Running this code creates a corpus however the document ids run from 1-6. Is there any way of creating the corpus with the document ids 1,3,5,7,8,10?

推荐答案

嗯,一种简单但不是很优雅的方式来分配你的 id 到你的文档可能如下:

Well, one simple but not very elegant way to assign your ids to your documents afterward could be the following :

for (i in 1:length(corpus)) {
   attr(corpus[[i]], "ID") <- df$ids[i]
}

这篇关于如何在语料库中手动设置文档 ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆