使用 R 语料库保留文档 ID [英] Keep document ID with R corpus

查看:29
本文介绍了使用 R 语料库保留文档 ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我搜索了 stackoverflow 和网络,只能找到部分解决方案或一些由于 TM 或 qdap 的变化而不起作用的解决方案.问题如下:

I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below:

我有一个数据框:IDText(简单的文档id/name,然后是一些text)

I have a dataframe: ID and Text (Simple document id/name and then some text)

我有两个问题:

第 1 部分:如何创建 tdm 或 dtm 并维护文档名称/ID?它只在检查(tdm)上显示字符(0)".
第 2 部分:我只想保留特定的术语列表,即与删除自定义停用词相反.我希望这发生在语料库中,而不是 tdm/dtm.

Part 1: How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm).
Part 2: I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus, not the tdm/dtm.

对于第 2 部分,我使用了我在此处获得的解决方案:如何在 tm 字典中实现邻近规则来计算单词?

这发生在 tdm 部分!对于使用tm_map(my.corpus, keepOnlyWords, customlist)"之类的内容的第 2 部分,是否有更好的解决方案?

For Part 2, I used a solution I got here: How to implement proximity rules in tm dictionary for counting words?

This one happens on the tdm part! Is there a better solution for Part 2 where you use something like "tm_map(my.corpus, keepOnlyWords, customlist)"?

任何帮助将不胜感激.非常感谢!

Any help will be greatly appreciated. Thanks much!

推荐答案

在较新版本的 tm 中,使用 DataframeSource() 函数更容易做到这一点.

In newer versions of tm this is a lot easier with the DataframeSource() function.

数据框源将数据框 x 的每一行解释为一个文档.第一列必须命名为doc_id"并包含每个文档的唯一字符串标识符.第二列必须命名为text"并包含表示文档内容的UTF-8"编码字符串.可选的附加列用作文档级元数据."

"A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata."

所以在这种情况下:

dd <-data.frame(
    doc_id=10:13,
    text=c("No wonder, then, that ever gathering volume from the mere transit ",
      "So that in many cases such a panic did he finally strike, that few ",
      "But there were still other and more vital practical influences at work",
      "Not even at the present day has the original prestige of the Sperm Whale")
    ,stringsAsFactors=F
 )

Corpus = VCorpus(DataframeSource(dd))

这篇关于使用 R 语料库保留文档 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆