加快R中大数据帧的处理 [英] Speeding up the processing of large data frames in R

查看:124
本文介绍了加快R中大数据帧的处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试实现最近在本文。给定大量的文本(语料库),算法应该返回语料库的特征 n - (即,单词的序列)。用户可以在原始文件中确定适当的 n ,并且正在尝试使用 n = 2-6。换句话说,使用算法,我想提取2到6克表征语料库。

I have been trying to implement the algorithm recently proposed in this paper. Given a large amount of text (corpus), the algorithm is supposed to return characteristic n-grams (i.e., sequence of n words) of the corpus. The user can decide the appropriate n, and at the moment I am trying with n = 2-6 as in the original paper. In other words, using the algorithm, I want to extract 2- to 6-grams that characterize the corpus.

我能够根据哪些特征来识别分数来计算分数,但一直在努力消除非特色的。

I was able to implement the part that calculates the score based on which characteristic n-grams are identified, but have been struggling to eliminate non-characteristic ones.

我有一个名为 token.df 包含五个数据帧,包括出现在语料库中的所有 n -grams。每个数据帧对应于 n 中的每个 n 。例如, token.df [[2]] 按字母顺序包括所有的二进制(2克)及其分数(以下称为mi)。

I have a list called token.df that contains five data frames including all the n-grams that appear in the corpus. Each data frame corresponds to each n in n-grams. For example, token.df[[2]] includes all the bigrams (2-grams) and their scores (called mi below) in the alphabetical order.

> head(token.df[[2]])
w1    w2      mi
_      eos  17.219346
_   global   7.141789
_     what   8.590394
0        0   2.076421
0       00   5.732846
0      000   3.426785

这里,bigram <0> 0 0 < (尽管他们不是这样的话)得分为2.076421。由于数据框包括出现在语料库中的所有 n 图,它们各自有超过一百万行。

Here, the bigram 0 0 (though they are not quite words as such) has the score of 2.076421. Since the data frames include all the n-grams that appear in the corpus, they each have over one million rows.

> sapply(token.df, nrow)
[[1]]
NULL

[[2]]
[1] 1006059  # number of unique bigrams in the corpus

[[3]]
[1] 2684027  # number of unique trigrams in the corpus

[[4]]
[1] 3635026  # number of unique 4-grams in the corpus

[[5]]
[1] 3965120  # number of unique 5-grams in the corpus

[[6]]
[1] 4055048  # number of unique 6-grams in the corpus



< h1>任务

我想确定要保留哪些 n - 要丢弃哪些。为此,该算法执行以下操作。

Task

I want to identify which n-grams to retain and which ones to discard. For this purpose, the algorithm does the following.


  1. bigrams

    • 它保留小组的成绩高于前两个字匹配双语组的三元组。


  • 对于每个 n -gram,其中 n = {3,4,5},它看起来是
      $符合 n -gram和的第一个 n-1 个单词的
    • 第一个 n 个字符与 n -gram匹配的 n + 1

    • For each n-gram where n = {3, 4, 5}, it looks at
      • the n-1 grams that match the first n-1 words of the n-gram and
      • the n+1 grams whose first n words match the n-gram.

      • 它保留了6克,分数高于与6克的前五个字符匹配的5克。



      示例



      Example

      > token.df[[2]][15, ]
       w1  w2       mi
        0 001 10.56292
      > token.df[[3]][33:38, ]
       w1  w2       w3        mi
        0 001     also  3.223091
        0 001 although  5.288097
        0 001      and  2.295903
        0 001      but  4.331710
        0 001 compared  6.270625
        0 001      dog 11.002312
      > token.df[[4]][46:48, ]
       w1  w2            w3      w4        mi
        0 001      compared      to  5.527626
        0 001           dog walkers 10.916028
        0 001 environmental concern 10.371769
      

      这里,由于其中一个三元组其前两个字匹配bigram(<001> <001> )具有比二进制(11.002312> 10.56292)更高的分数。因为它的得分(11.002312)高于与三元组的前两个字符匹配的二进制字组(<001> ;得分= 10.56292),所以三元组<001> 被保留)和4克的前三个字匹配三元组(<001>狗步行者;得分= 10.916028)。

      Here, the bigram 0 001 is not retained because one of the trigrams whose first two words match the bigram (0 001 dog) has a higher score than the bigram (11.002312 > 10.56292). The trigram 0 001 dog is retained because its score (11.002312) is higher than that of the bigram that matches the first two words of the trigram (0 001; score = 10.56292) and that of the 4-gram whose first three words match the trigram (0 001 dog walkers; score = 10.916028).

      我想知道的是实现上述的有效方法。为了确定要保留哪些字母,例如,我需要找出每一行 token.df [[2]] token.df [[3]] 在前两个字与前两个字相当。然而,由于行数很大,所以我的迭代方法下面需要太长时间才能运行。他们专注于双杂志的情况,因为任务看起来比3-5克的情况更简单。

      What I would like to know is an efficient way to achieve the above. In order to determine which bigrams to retain, for example, I need to find out for each row of token.df[[2]] which rows in token.df[[3]] have the first two words identical to the bigram in concern. However, since the number of rows is large, my iteration approaches below take too long time to run. They focus on the case of bigrams because the task looked simpler than the case of 3-5 grams.


      1. for 循环方法。

        由于下面的代码遍历 token.df [[3]]的所有行在每次迭代,估计需要几个月才能运行。虽然稍微好一点,但是()的类似。

      1. The for loop approach.
        Since the code below goes over all the rows of token.df[[3]] at each iteration, it was estimated to take months to run. Though slightly better, similar was the case with by().

      # for loop
      retain <- numeric(nrow(token.df[[2]]))
      for (i in 1:nrow(token.df[[2]])) {
          mis <- token.df[[3]]$mi[token.df[[2]][i, ]$w1 == token.df[[3]][ , 1] & token.df[[2]][i, ]$w2 == token.df[[3]][ , 2]]
          retain[i] <- ifelse(token.df[[2]]$mi[i] > max(mis), TRUE, FALSE)
      }
      
      # by
      mis <- by(token.df[[2]], 1:nrow(token.df[[2]]), function(x) token.df[[3]]$mi[x$w1 == token.df[[3]]$w1 & x$w2 == token.df[[3]]$w2])
      retain <- sapply(seq(mis), function(i) token.df[[2]]$mi[i] > max(mis[[i]]))
      


    • 指针方法。

      上述问题是在(垂直)长数据帧上的大量迭代。为了减轻这个问题,我想我可以使用这样一个事实,即在每个数据帧中按照字母顺序排序,并使用一种指针来指示哪一行开始查找。然而,这种方法也需要太长的时间才能运行(至少需要几天)。

    • The pointer approach.
      The problem with the above is the large number of iterations over a (vertically) long data frame. To alleviate the issue, I thought I can use the fact that n-grams are alphabetically sorted in each data frame and employ a kind of pointer indicating at which row to start looking. However, this approach, too, takes too long to run (at least several days).

      retain <- numeric(nrow(token.df[[2]]))
      nrow <- nrow(token.df[[3]]) # number of rows of the trigram data frame
      pos <- 1 # pointer
      for (i in seq(nrow(token.df[[2]]))) {
          j <- 1
          target.rows <- numeric(10)
          while (TRUE) {
              if (pos == nrow + 1 || !all(token.df[[2]][i, 1:2] == token.df[[3]][pos, 1:2])) break
              target.rows[j] <- pos
              pos <- pos + 1
              if (j %% 10 == 0) target.rows <- c(target.rows, numeric(10))
              j <- j + 1
          }
          target.rows <- target.rows[target.rows != 0]
          retain[i] <- ifelse(token.df[[2]]$mi[i] > max(token.df[[3]]$mi[target.rows]), TRUE, FALSE)
      }
      


    • 有没有办法在合理的时间内(例如,一夜之间)完成这项工作?现在迭代方法是徒劳的,我想知道是否有可能的向量化。但是我可以采取任何方式来加快这个过程。

      Is there a way to do this task within a reasonable amount of time (e.g., overnight)? Now that iteration approaches have been in vain, I am wondering if any vectorization is possible. But I am open to any means to speed up the process.

      数据有一个树结构,一个二进制分为一个或多个三元组,每个都是转弯分为一个或多个4克,依此类推。我不知道如何处理这种数据。

      The data have a tree structure in that one bigram is divided into one or more trigrams, each of which in turn is divided into one or more 4-grams, and so forth. I am not sure how best to process this kind of data.

      我想到了我正在使用的真实数据的一部分,但是削减数据会使问题的整体出现。我假设人们不想下载250MB的整个数据集,也没有权限上传它。以下是随机数据集,仍然比我使用的还小,但有助于体验问题。使用上面的代码(指针方法),需要我的计算机4-5秒来处理下面的$ code> token.df [[2]] 的前100行,大概需要12个小时才能处理所有的两个字母。

      I thought about putting up part of the real data I'm using, but cutting down the data ruins the whole point of the issue. I assume people do not want to download the whole data set of 250MB just for this, nor do I have a right to upload it. Below is the random data set that is still smaller than that I'm using but helps to experience the problem. With the code above (the pointer approach), it takes my computer 4-5 seconds to process the first 100 rows of token.df[[2]] below and it presumably takes 12 hours just to process all the bigrams.

      token.df <- list()
      types <- combn(LETTERS, 4, paste, collapse = "")
      set.seed(1)
      data <- data.frame(matrix(sample(types, 6 * 1E6, replace = TRUE), ncol = 6), stringsAsFactors = FALSE)
      colnames(data) <- paste0("w", 1:6)
      data <- data[order(data$w1, data$w2, data$w3, data$w4, data$w5, data$w6), ]
      set.seed(1)
      for (n in 2:6) token.df[[n]] <- cbind(data[ , 1:n], mi = runif(1E6))
      

      推荐答案

      以下在我的机器上运行7秒以下,适用于所有的二进制格式:

      The following runs in under 7 seconds on my machine, for all the bigrams:

      library(dplyr)
      res <- inner_join(token.df[[2]],token.df[[3]],by = c('w1','w2'))
      res <- group_by(res,w1,w2)
      bigrams <- filter(summarise(res,keep = all(mi.y < mi.x)),keep)
      

      这里没有什么特别的 dplyr 。可以使用 data.table 或直接在SQL中完成同样快速(或更快)的解决方案。您只需要切换到使用连接(如在SQL中),而不是自己遍历一切。实际上,如果在基本R中简单地使用 merge ,然后聚合将不会被命令,我不会感到惊讶比你现在在做的更快。 (但您真的应该使用 data.table dplyr 或直接在SQL数据库中执行此操作。

      There's nothing special about dplyr here. An equally fast (or faster) solution could surely be done using data.table or directly in SQL. You just need to switch to using joins (as in SQL) rather than iterating through everything yourself. In fact, I wouldn't be surprised if simply using merge in base R and then aggregate wouldn't be orders of magnitude faster than what you're doing now. (But you really should be doing this with data.table, dplyr or directly in a SQL data base).

      确实,这个:

      library(data.table)
      dt2 <- setkey(data.table(token.df[[2]]),w1,w2)
      dt3 <- setkey(data.table(token.df[[3]]),w1,w2)
      dt_tmp <- dt3[dt2,allow.cartesian = TRUE][,list(k = all(mi < mi.1)),by = c('w1','w2')][(k)]
      

      甚至更快(〜2x)。我甚至不确定自己已经把所有的速度都挤压了,说实话。

      is even faster still (~2x). I'm not even really sure that I've squeezed all the speed I could have out of either package, to be honest.

      (从里克编辑,尝试作为评论,但语法被搞砸了)

      如果使用 data.table ,这应该更快,因为 data.table 有一个 by-without-by 功能(见?data.table 了解更多信息):

      (edit from Rick. Attempted as comment, but syntax was getting messed up)
      If using data.table, this should be even faster, as data.table has a by-without-by feature (See ?data.table for more info):

       dt_tmp <- dt3[dt2,list(k = all(mi < i.mi)), allow.cartesian = TRUE][(k)]
      

      请注意,加入 data.tables 时,可以使用 i。将列名前缀指示在 i = 参数中特别使用data.table列。

      Note that when joining data.tables you can preface the column names with i. to indicate to use the column from specifically the data.table in the i= argument.

      这篇关于加快R中大数据帧的处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆