从多个txt文件创建语料库 [英] creating corpus from multiple txt files

查看：167 发布时间：2020/7/11 1:24:34 r tidytext

本文介绍了从多个txt文件创建语料库的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有多个txt文件，我想要一个整洁的数据.首先要做的是创建语料库(我不确定这是真的方法).我编写了以下代码来获取语料库数据.

I have multiple txt files, I want to have a tidy data. To do that first I create corpus ( I am not sure is it true way to do it). I wrote the following code to have the corpus data.

folder<-"C:\\Users\\user\\Desktop\\text analysis\\doc"
list.files(path=folder) 
filelist<- list.files(path=folder, pattern="*.txt")
paste(folder, "\\", filelist)
filelist<-paste(folder, "\\", filelist, sep="")
typeof(filelist)
a<- lapply(filelist,FUN=readLines)
corpus <- lapply(a ,FUN=paste, collapse=" ")

当我检查class(corpus)时，它返回list.从那时起，我如何创建整洁的数据?

When I check the class(corpus) it returns list. From that point how can I create tidy data?

推荐答案

如果您有文本文件，并且想要整洁的数据，我会直接从一个文件转到另一个文件，而不会打扰它们之间的tm包.

If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.

要在工作目录中查找所有文本文件，可以将list.files与参数一起使用:

To find all the text files within a working directory, you can use list.files with an argument:

all_txts <- list.files(pattern = ".txt$")

all_txts对象将成为包含您所有文件名的字符向量.

The all_txts object will then be a character vector that contains all your filenames.

然后，您可以设置管道以读取所有文本文件，并使用带有purrr中map功能的tidytext将它们取消嵌套.如果需要，可以在map()中使用mutate()用文件名注释每一行.

Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.

library(tidyverse)
library(tidytext)

map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
        mutate(filename = basename(.x)) %>%
        unnest_tokens(word, txt))

这篇关于从多个txt文件创建语料库的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从多个txt文件创建语料库 [英] creating corpus from multiple txt files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从多个txt文件创建语料库 [英] creating corpus from multiple txt files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭