将标记列表转换为 n-gram [英] Converting a list of tokens to n-grams

查看:23
本文介绍了将标记列表转换为 n-gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个已经被标记化的文档列表:

I have a list of documents that have already been tokenized:

dat <- list(c("texaco", "canada", "lowered", "contract", "price", "pay", 
"crude", "oil", "canadian", "cts", "barrel", "effective", "decrease", 
"brings", "companys", "posted", "price", "benchmark", "grade", 
"edmonton", "swann", "hills", "light", "sweet", "canadian", "dlrs", 
"bbl", "texaco", "canada", "changed", "crude", "oil", "postings", 
"feb", "reuter"), c("argentine", "crude", "oil", "production", 
"pct", "january", "mln", "barrels", "mln", "barrels", "january", 
"yacimientos", "petroliferos", "fiscales", "january", "natural", 
"gas", "output", "totalled", "billion", "cubic", "metrers", "pct", 
"billion", "cubic", "metres", "produced", "january", "yacimientos", 
"petroliferos", "fiscales", "added", "reuter"))

我正在尝试有效地将此标记列表转换为 n-gram 列表.这是我目前编写的函数:

I'm trying to efficiently convert this list of tokens to a list of n-grams. Here's the function I've written so far:

find_ngrams <- function(x, n){

  if (n==1){ return(x)}

  out <- as.list(rep(NA, length(x)))

  for (i in 1:length(x)){
    words <- x[[i]]
    out[[i]] <- words

    for (j in 2:n){

      phrases <- sapply(1:j, function(k){
        words[k:(length(words)-n+k)]
      })

      phrases <- apply(phrases, 1, paste, collapse=" ")

      out[[i]]  <- c(out[[i]], phrases)

    }
  }
  return(out)
}

这对于查找 n-gram 很有效,但似乎效率低下.用 *apply 函数替换 for 循环仍然会给我留下嵌套 3-deep 的循环:

This works fine for finding n-grams, but it seems inefficient. Replacing the for-loops with *apply functions would still leaves me with loops nested 3-deep:

result <- find_ngrams(dat, 2)
> result[[2]]
 [1] "argentine"                "crude"                    "oil"                     
 [4] "production"               "pct"                      "january"                 
 [7] "mln"                      "barrels"                  "mln"                     
[10] "barrels"                  "january"                  "yacimientos"             
[13] "petroliferos"             "fiscales"                 "january"                 
[16] "natural"                  "gas"                      "output"                  
[19] "totalled"                 "billion"                  "cubic"                   
[22] "metrers"                  "pct"                      "billion"                 
[25] "cubic"                    "metres"                   "produced"                
[28] "january"                  "yacimientos"              "petroliferos"            
[31] "fiscales"                 "added"                    "reuter"                  
[34] "argentine crude"          "crude oil"                "oil production"          
[37] "production pct"           "pct january"              "january mln"             
[40] "mln barrels"              "barrels mln"              "mln barrels"             
[43] "barrels january"          "january yacimientos"      "yacimientos petroliferos"
[46] "petroliferos fiscales"    "fiscales january"         "january natural"         
[49] "natural gas"              "gas output"               "output totalled"         
[52] "totalled billion"         "billion cubic"            "cubic metrers"           
[55] "metrers pct"              "pct billion"              "billion cubic"           
[58] "cubic metres"             "metres produced"          "produced january"        
[61] "january yacimientos"      "yacimientos petroliferos" "petroliferos fiscales"   
[64] "fiscales added"           "added reuter"            

这段代码中是否有任何重要的部分可以向量化?

Are there any significant parts of this code that could be vectorized?

/edit:这是 Matthew Plourde 函数的更新版本,它执行最多 n-gram"并适用于整个列表:

/edit: here's an updated version of Matthew Plourde's function, that does "up-to-n-grams" and works across the entire list:

find_ngrams_base <- function(x, n) {
  if (n == 1) return(x)
  out <- lapply(1:n, function(n_i) embed(x, n_i))
  out <- sapply(out, function(y) apply(y, 1, function(row) paste(rev(row), collapse=' ')))
  unlist(out)
}

find_ngrams_plourde <- function(x, ...){
  lapply(x, find_ngrams_base, ...)
}

我们可以对我写的函数进行基准测试,发现它有点慢:

We can benchmark against the function I wrote, and see that it's a bit slower:

library(rbenchmark)
benchmark(
  replications=100,
  a <- find_ngrams(dat, 2),
  b <- find_ngrams(dat, 3),
  c <- find_ngrams(dat, 4),
  d <- find_ngrams(dat, 10),
  w <- find_ngrams_plourde(dat, 2),
  x <- find_ngrams_plourde(dat, 3),
  y <- find_ngrams_plourde(dat, 4),
  z <- find_ngrams_plourde(dat, 10),
  columns=c('test', 'elapsed', 'relative'),
  order='relative'
)
                               test elapsed relative
1          a <- find_ngrams(dat, 2)   0.040    1.000
2          b <- find_ngrams(dat, 3)   0.081    2.025
3          c <- find_ngrams(dat, 4)   0.117    2.925
5  w <- find_ngrams_plourde(dat, 2)   0.144    3.600
6  x <- find_ngrams_plourde(dat, 3)   0.212    5.300
7  y <- find_ngrams_plourde(dat, 4)   0.277    6.925
4         d <- find_ngrams(dat, 10)   0.361    9.025
8 z <- find_ngrams_plourde(dat, 10)   0.669   16.725

然而,它也发现了很多我的函数遗漏的 ngrams(哎呀):

However, it also find a lot of ngrams my function misses (whoops):

for (i in 1:length(dat)){
  print(setdiff(w[[i]], a[[i]]))
  print(setdiff(x[[i]], b[[i]]))
  print(setdiff(y[[i]], c[[i]]))
  print(setdiff(z[[i]], d[[i]]))
}

我觉得这两个函数都可以改进,但我想不出任何方法来避免三重循环(循环向量,循环所需的 ngrams 数量,1-n,循环单词以构建ngrams)

I feel like both function can be improved, but I can't think of any way to avoid the triple loop (loop over the vectors, loop over the number of ngrams needed, 1-n, loop over the words to construct ngrams)

/编辑 2:这是一个修改后的函数,基于马特的回答:

/edit 2: Here's a revised function, based off Matt's answer:

find_ngrams_2 <- function(x, n){
  if (n == 1) return(x)
  lapply(x, function(y) c(y, unlist(lapply(2:n, function(n_i) do.call(paste, unname(rev(data.frame(embed(y, n_i), stringsAsFactors=FALSE))))))))
}

它似乎返回了正确的 ngram 列表,并且比我的原始函数更快(在大多数情况下):

It seems to return the correct list of ngrams, and it is faster (in most cases) than my original function:

library(rbenchmark)
benchmark(
  replications=100,
  a <- find_ngrams(dat, 2),
  b <- find_ngrams(dat, 3),
  c <- find_ngrams(dat, 4),
  d <- find_ngrams(dat, 10),
  w <- find_ngrams_2(dat, 2),
  x <- find_ngrams_2(dat, 3),
  y <- find_ngrams_2(dat, 4),
  z <- find_ngrams_2(dat, 10),
  columns=c('test', 'elapsed', 'relative'),
  order='relative'
)

                         test elapsed relative
5  w <- find_ngrams_2(dat, 2)   0.039    1.000
1    a <- find_ngrams(dat, 2)   0.041    1.051
6  x <- find_ngrams_2(dat, 3)   0.078    2.000
2    b <- find_ngrams(dat, 3)   0.081    2.077
7  y <- find_ngrams_2(dat, 4)   0.119    3.051
3    c <- find_ngrams(dat, 4)   0.123    3.154
4   d <- find_ngrams(dat, 10)   0.399   10.231
8 z <- find_ngrams_2(dat, 10)   0.436   11.179

推荐答案

这是 embed 的一种方式.

find_ngrams <- function(x, n) {
    if (n == 1) return(x)
    c(x, apply(embed(x, n), 1, function(row) paste(rev(row), collapse=' ')))
}

您的函数中似乎存在错误.如果你解决了这个问题,我们可以做一个基准测试.

There seems to be a bug in your function. If you fix that, we can do a benchmark.

这篇关于将标记列表转换为 n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆