加快R中大数据帧的处理 [英] Speeding up the processing of large data frames in R

查看：124 发布时间：2017/3/26 1:15:14 r dataframe corpus

本文介绍了加快R中大数据帧的处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在尝试实现最近在本文。给定大量的文本（语料库），算法应该返回语料库的特征 n - （即，单词的序列）。用户可以在原始文件中确定适当的 n ，并且正在尝试使用 n = 2-6。换句话说，使用算法，我想提取2到6克表征语料库。

I have been trying to implement the algorithm recently proposed in this paper. Given a large amount of text (corpus), the algorithm is supposed to return characteristic n-grams (i.e., sequence of n words) of the corpus. The user can decide the appropriate n, and at the moment I am trying with n = 2-6 as in the original paper. In other words, using the algorithm, I want to extract 2- to 6-grams that characterize the corpus.

我能够根据哪些特征来识别分数来计算分数，但一直在努力消除非特色的。

I was able to implement the part that calculates the score based on which characteristic n-grams are identified, but have been struggling to eliminate non-characteristic ones.

我有一个名为 token.df 包含五个数据帧，包括出现在语料库中的所有 n -grams。每个数据帧对应于 n 中的每个 n 。例如， token.df [[2]] 按字母顺序包括所有的二进制（2克）及其分数（以下称为mi）。

I have a list called token.df that contains five data frames including all the n-grams that appear in the corpus. Each data frame corresponds to each n in n-grams. For example, token.df[[2]] includes all the bigrams (2-grams) and their scores (called mi below) in the alphabetical order.

> head(token.df[[2]])
w1    w2      mi
_      eos  17.219346
_   global   7.141789
_     what   8.590394
0        0   2.076421
0       00   5.732846
0      000   3.426785

这里，bigram <0> 0 0 < （尽管他们不是这样的话）得分为2.076421。由于数据框包括出现在语料库中的所有 n 图，它们各自有超过一百万行。

Here, the bigram 0 0 (though they are not quite words as such) has the score of 2.076421. Since the data frames include all the n-grams that appear in the corpus, they each have over one million rows.

> sapply(token.df, nrow)
[[1]]
NULL

[[2]]
[1] 1006059  # number of unique bigrams in the corpus

[[3]]
[1] 2684027  # number of unique trigrams in the corpus

[[4]]
[1] 3635026  # number of unique 4-grams in the corpus

[[5]]
[1] 3965120  # number of unique 5-grams in the corpus

[[6]]
[1] 4055048  # number of unique 6-grams in the corpus

< h1>任务

我想确定要保留哪些 n - 要丢弃哪些。为此，该算法执行以下操作。

Task

I want to identify which n-grams to retain and which ones to discard. For this purpose, the algorithm does the following.

bigrams
- 它保留小组的成绩高于前两个字匹配双语组的三元组。

对于每个 n -gram，其中 n = {3，4，5}，它看起来是
- 第一个 n 个字符与 n -gram匹配的 n + 1 克
- For each n-gram where n = {3, 4, 5}, it looks at
  - the n-1 grams that match the first n-1 words of the n-gram and
  - the n+1 grams whose first n words match the n-gram.
  - 它保留了6克，分数高于与6克的前五个字符匹配的5克。
  示例
  
  Example
```
> token.df[[2]][15, ]
 w1  w2       mi
  0 001 10.56292
> token.df[[3]][33:38, ]
 w1  w2       w3        mi
  0 001     also  3.223091
  0 001 although  5.288097
  0 001      and  2.295903
  0 001      but  4.331710
  0 001 compared  6.270625
  0 001      dog 11.002312
> token.df[[4]][46:48, ]
 w1  w2            w3      w4        mi
  0 001      compared      to  5.527626
  0 001           dog walkers 10.916028
  0 001 environmental concern 10.371769
```
  这里，由于其中一个三元组其前两个字匹配bigram（<001> <001> ）具有比二进制（11.002312> 10.56292）更高的分数。因为它的得分（11.002312）高于与三元组的前两个字符匹配的二进制字组（<001> ;得分= 10.56292），所以三元组<001> 被保留）和4克的前三个字匹配三元组（<001>狗步行者;得分= 10.916028）。
  
  Here, the bigram 0 001 is not retained because one of the trigrams whose first two words match the bigram (0 001 dog) has a higher score than the bigram (11.002312 > 10.56292). The trigram 0 001 dog is retained because its score (11.002312) is higher than that of the bigram that matches the first two words of the trigram (0 001; score = 10.56292) and that of the 4-gram whose first three words match the trigram (0 001 dog walkers; score = 10.916028).
  
  我想知道的是实现上述的有效方法。为了确定要保留哪些字母，例如，我需要找出每一行 token.df [[2]] token.df [[3]] 在前两个字与前两个字相当。然而，由于行数很大，所以我的迭代方法下面需要太长时间才能运行。他们专注于双杂志的情况，因为任务看起来比3-5克的情况更简单。
  
  What I would like to know is an efficient way to achieve the above. In order to determine which bigrams to retain, for example, I need to find out for each row of token.df[[2]] which rows in token.df[[3]] have the first two words identical to the bigram in concern. However, since the number of rows is large, my iteration approaches below take too long time to run. They focus on the case of bigrams because the task looked simpler than the case of 3-5 grams.
  1. for 循环方法。
    
    由于下面的代码遍历 token.df [[3]]的所有行在每次迭代，估计需要几个月才能运行。虽然稍微好一点，但是（）的类似。
  The for loop approach. Since the code below goes over all the rows of token.df[[3]] at each iteration, it was estimated to take months to run. Though slightly better, similar was the case with by(). # for loop retain <- numeric(nrow(token.df[[2]])) for (i in 1:nrow(token.df[[2]])) { mis <- token.df[[3]]$mi[token.df[[2]][i, ]$w1 == token.df[[3]][ , 1] & token.df[[2]][i, ]$w2 == token.df[[3]][ , 2]] retain[i] <- ifelse(token.df[[2]]$mi[i] > max(mis), TRUE, FALSE) } # by mis <- by(token.df[[2]], 1:nrow(token.df[[2]]), function(x) token.df[[3]]$mi[x$w1 == token.df[[3]]$w1 & x$w2 == token.df[[3]]$w2]) retain <- sapply(seq(mis), function(i) token.df[[2]]$mi[i] > max(mis[[i]]))

查看全文

加快R中大数据帧的处理 [英] Speeding up the processing of large data frames in R

问题描述

Task

示例

Example

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

加快R中大数据帧的处理 [英] Speeding up the processing of large data frames in R

问题描述

Task

示例

Example

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭