如何使用R中的元数据将语料库转换为data.frame [英] How to convert corpus to data.frame with meta data in R

查看：47 发布时间：2021/9/8 20:09:42 r tm

本文介绍了如何使用R中的元数据将语料库转换为data.frame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何将语料库转换为 R 中还包含元数据的数据框?我已经尝试了将语料库转换为 R 中的 data.frame 的建议，但结果数据框仅包含语料库中所有文档的文本行.我还需要文档 ID 以及两列中文本行的行号.那么，我该如何扩展这个命令:dataframe <- data.frame(text=unlist(sapply(mycorpus,[, "content")), stringsAsFactors=FALSE) 获取数据?

how can I convert a corpus into a data frame in R which contains also meta data? I already tried the suggestion from convert corpus into data.frame in R, but the resulting data frame only contains the text lines from all docs in the corpus. I need also the document ID and maybe the line number of the text line in two columns. So, how can I extend this command: dataframe <- data.frame(text=unlist(sapply(mycorpus,[, "content")), stringsAsFactors=FALSE) to get the data?

我已经试过了

    dataframe <- 
data.frame(id=sapply(corpus, meta(corpus, "id")), 
text=unlist(sapply(corpus, `[`, "content")), 
stringsAsFactors=F)

但它没有帮助；我只收到一条错误消息match.fun(FUN) 中的错误:'meta(corpus, "id")' 不是 Funktion, Zeichen oder Symbol"

but it didn't help; I only got an error message "Error in match.fun(FUN) : 'meta(corpus, "id")' ist nicht Funktion, Zeichen oder Symbol"

语料提取自纯文本文件；这是一个例子:

The corpus is extracted from plain text files; here is an example:

> str(corpus)
[...]
$ 1178531510 :List of 2
  ..$ content: chr [1:67] " uberrasch sagt [...] gemacht echt schad verursacht" ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2015-08-16 14:44:11"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1178531510" # <--- This is the ID i want in the data.frame
  .. ..$ language     : chr "de"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
[...]

非常感谢:)

推荐答案

有两个问题:sapply 中的论点语料不要重复，多段文本转为字符向量长度 > 1，您应该在取消上市之前将其粘贴在一起.

There are two problems : you should not repeat the argument corpus in sapply, and multi-paragraphs texts are turned to character vectors of length > 1 which you should paste together before unlisting.

dataframe <- 
    data.frame(id=sapply(corpus, meta, "id"),
               text=unlist(lapply(sapply(corpus, '[', "content"),paste,collapse="\n")),
               stringsAsFactors=FALSE)

这篇关于如何使用R中的元数据将语料库转换为data.frame的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用R中的元数据将语料库转换为data.frame [英] How to convert corpus to data.frame with meta data in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用R中的元数据将语料库转换为data.frame [英] How to convert corpus to data.frame with meta data in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭