在R中使用大型文本文件来创建n-gram [英] Working with large text files in R to create n-grams

查看:79
本文介绍了在R中使用大型文本文件来创建n-gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用R编程环境中的"quanteda"包从一个大的(1GB)文本文件中创建三字母组和二元组.如果我尝试一次运行代码(如下所示),R就会挂起(在第三行-myCorpus< -toLower(...)).我在小于1mb的小型数据集上成功使用了代码,因此我猜该文件太大.我可以看到我也许需要将文本加载到块"中,然后再将由此产生的双字母组和三字母组的频率结合起来.但是我无法解决如何在可管理的块"中加载和处理文本.任何有关解决此问题的方法的建议都将受到欢迎.我的代码粘贴在下面.任何其他改进我的代码的方法的建议也将受到欢迎.

I am trying to create trigrams and bigrams from a large (1GB) text file using the 'quanteda' package in the R programming environment. If I try and run my code in one go (as below) R just hangs (on the 3rd line - myCorpus<-toLower(...)). I used the code successfully on a small dataset <1mb, so I guess the file is too large. I can see I perhaps need to load the text in 'chunks' and combine the resulting frequencies of bigrams and trigrams afterwards. But I cannot work out how to load and process the text in manageable 'chunks'. Any advice on an approach to this problem would be very welcome. My code is pasted below. Any suggestions for other approaches for improving my code would be also welcome.

  folder.dataset.english <- 'final/corpus'


myCorpus <- corpus(x=textfile(list.files(path = folder.dataset.english, pattern = "\\.txt$", full.names = TRUE, recursive = FALSE)))  # build the corpus

myCorpus<-toLower(myCorpus, keepAcronyms = TRUE)

#bigrams
bigrams<-dfm(myCorpus, ngrams = 2,verbose = TRUE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,removeTwitter = TRUE, stem = FALSE) 
bigrams_freq<-sort(colSums(bigrams),decreasing=T)
bigrams<-data.frame(names=names(bigrams_freq),freq=bigrams_freq,stringsAsFactors =FALSE)
bigrams$first<- sapply(strsplit(bigrams$names, "_"), "[[", 1)
bigrams$last<-  sapply(strsplit(bigrams$names, "_"), "[[", 2)
rownames(bigrams)<-NULL
bigrams.freq.freq<-table(bigrams$freq)
saveRDS(bigrams,"dictionaries/bigrams.rds")

#trigrams
trigrams<-dfm(myCorpus, ngrams = 3,verbose = TRUE, toLower = TRUE,
              removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
              removeTwitter = TRUE, stem = FALSE) 
trigrams_freq<-sort(colSums(trigrams),decreasing=T)
trigrams<-data.frame(names=names(trigrams_freq),freq=trigrams_freq,stringsAsFactors =FALSE)

trigrams$first<-paste(sapply(strsplit(trigrams$names, "_"), "[[", 1),sapply(strsplit(trigrams$names, "_"), "[[", 2),sep="_")
trigrams$last<-sapply(strsplit(trigrams$names, "_"), "[[", 3)
rownames(trigrams)<-NULL
saveRDS(trigrams,"dictionaries/trigrams.rds")

推荐答案

经历了很多头痛之后,我以一种蛮力的方式解决了自己,这让我有些尴尬,但是无论如何我都会展示一下!我确信还有更多优雅而有效的方法(请随时教育我)我只需要处理一次此文本,因此我想这种微妙的解决方案并没有太大关系.

After much headache, I kind of solved this myself, in a very brute force way, which I am slightly embarrassed about, but I will show it anyway! I am sure there are more elegant and efficient ways (please feel free to educate me) I only need to process this text once, so I guess the inelegant solution doesn't matter quite so much.

我将其转换为"tm"包V.Corpus对象,该对象由三个大文本文件组成,然后遍历这三个文本文件并手动切分语料库,一次处理每个块.为了清楚起见,这里我没有插入并探究上面给出的处理代码.我刚刚指出了需要拼接的位置.现在我需要添加一些代码来累积每个块的结果.

I converted to a 'tm' package V.Corpus object, this consisted of three large text files, then iterated through the three text files and manually sliced up the corpus processing each chunk at a time. I have here not inserted and plumbed in the processing code given above for clarity of understanding. I just indicated where I needed to stitch that in. I just now need to add some code to accumulate the results from each chunk.

library(tm)

 folder.dataset.english <- 'final/corpus'
    corpus <- VCorpus(DirSource(directory=folder.dataset.english, encoding = "UTF-8",recursive=FALSE),
                      readerControl = list(language = "en"))
    chunk.size<-100000


    for(t in 1:3){
        l=1
        h=chunk.size
        stp=0
        corp.size<-length(corpus[[t]]$content)
          repeat{  
            if(stp==2)break
            corpus.chunk<-corpus[[t]]$content[l:h]
            l<-h+1
            h<-h+chunk.size
    ####Processing code in here


    #####Processing code ends here
            if(h>corp.size){
            h<-corp.size
            stp<-stp+1      }
                  }
                }

这篇关于在R中使用大型文本文件来创建n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆