并行parLapply设置 [英] parallel parLapply setup

查看:83
本文介绍了并行parLapply设置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试并行使用来自openNLP/NLP包的部分语音标记.我需要在任何OS上都能运行的代码,因此选择并行使用parLapply函数(但对其他OS无关的选项开放).过去,我从parLapply的openNLP包中运行tagPOS函数没有问题.但是,openNLP软件包最近进行了一些更改,从而取消了tagPOS并添加了一些更灵活的选项. Kurt非常友好,可以帮助我从新程序包的工具中重新创建tagPOS函数.我可以使用lapply版本,但不能使用并行版本.它一直说节点需要更多的变量传递给它们,直到最终要求openNLP提供非导出的功能.这似乎很奇怪,它会不断要求传递越来越多的变量,这告诉我我错误地设置了parLapply.如何设置tagPOS以并行,独立于操作系统的方式运行?

I am trying to use part of speech tagging from the openNLP/NLP packages in parallel. I need the code to work on any OS so am opting to use the parLapply function from parallel (but am open to other OS independent options). In the past I ran tagPOS function from the openNLP package in parLapply with no problem. However, the openNLP package had some recent changes that eliminated tagPOS and added some more flexible options. Kurt was kind enough to help me recreate the tagPOS function from the new package's tools. I can get the lapply version to work but not the parallel version. It keeps saying the nodes need more variables passed to them until it finally asks for a non-exported function from openNLP. This seems odd it would keep asking for more and more variables to be passed which tells me I'm setting up the parLapply incorrectly. How can I set up the tagPOS to operate in an parallel, OS independent fashion?

library(openNLP)
library(NLP)
library(parallel)

## POS tagger
tagPOS <-  function(x, pos_tag_annotator, ...) {
    s <- as.String(x)
    ## Need sentence and word token annotations.
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, pos_tag_annotator, a2)

    ## Determine the distribution of POS tags for word tokens.
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))

    ## Extract token/POS pairs (all of them): easy.
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
} ## End of tagPOS function 

## Set up a parallel run
text.var <- c("I like it.", "This is outstanding soup!",  
    "I really must get the recipe.")
ntv <- length(text.var)
PTA <- Maxent_POS_Tag_Annotator()   

cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterExport(cl=cl, varlist=c("text.var", "ntv", 
    "tagPOS", "PTA", "as.String", "Maxent_Word_Token_Annotator"), 
    envir = environment())
m <- parLapply(cl, seq_len(ntv), function(i) {
        x <- tagPOS(text.var[i], PTA)
        return(x)
    }
)
stopCluster(cl)

## Error in checkForRemoteErrors(val) : 
##   3 nodes produced errors; first error: could not find function 
##   "Maxent_Simple_Word_Tokenizer"

openNLP::Maxent_Simple_Word_Tokenizer

## >openNLP::Maxent_Simple_Word_Tokenizer
## Error: 'Maxent_Simple_Word_Tokenizer' is not an exported 
##     object from 'namespace:openNLP'

## It's a non exported function
openNLP:::Maxent_Simple_Word_Tokenizer


## Demo that it works with lapply
lapply(seq_len(ntv), function(i) {
    tagPOS(text.var[i], PTA)
})

lapply(text.var, function(x) {
    tagPOS(x, PTA)
})

## >     lapply(seq_len(ntv), function(i) {
## +         tagPOS(text.var[i], PTA)
## +     })
## [[1]]
## [[1]]$POStagged
## [1] "I/PRP like/IN it/PRP ./."
## 
## [[1]]$POStags
## [1] "PRP" "IN"  "PRP" "."  
## 
## [[1]]$word.count
## [1] 3
## 
## 
## [[2]]
## [[2]]$POStagged
## [1] "THis/DT is/VBZ outstanding/JJ soup/NN !/."
## 
## [[2]]$POStags
## [1] "DT"  "VBZ" "JJ"  "NN"  "."  
## 
## [[2]]$word.count
## [1] 4
## 
## 
## [[3]]
## [[3]]$POStagged
## [1] "I/PRP really/RB must/MD get/VB the/DT recip/NN ./."
## 
## [[3]]$POStags
## [1] "PRP" "RB"  "MD"  "VB"  "DT"  "NN"  "."  
## 
## [[3]]$word.count
## [1] 6

根据史蒂夫的建议

请注意,openNLP是全新的.我从CRAN的tar.gz安装了2.1版.即使此功能存在,我也会收到以下错误消息.

Note the openNLP is brand new. I installed ver 2.1 from a tar.gz from CRAN. I get the following error even though this function exists.

library(openNLP); library(NLP); library(parallel)

tagPOS <-  function(text.var, pos_tag_annotator, ...) {
    s <- as.String(text.var)

    ## Set up the POS annotator if missing (for parallel)
    if (missing(pos_tag_annotator)) {
        PTA <- Maxent_POS_Tag_Annotator()
    }

    ## Need sentence and word token annotations.
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, PTA, a2)

    ## Determine the distribution of POS tags for word tokens.
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, "[[", "POS"))

    ## Extract token/POS pairs (all of them): easy.
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
}

text.var <- c("I like it.", "This is outstanding soup!",  
    "I really must get the recipe.")

cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {library(openNLP); library(NLP)})
m <- parLapply(cl, text.var, tagPOS)

## > m <- parLapply(cl, text.var, tagPOS)
## Error in checkForRemoteErrors(val) : 
##   3 nodes produced errors; first error: could not find function "Maxent_POS_Tag_Annotator"

stopCluster(cl)


> packageDescription('openNLP')
Package: openNLP
Encoding: UTF-8
Version: 0.2-1
Title: Apache OpenNLP Tools Interface
Authors@R: person("Kurt", "Hornik", role = c("aut", "cre"), email =
          "Kurt.Hornik@R-project.org")
Description: An interface to the Apache OpenNLP tools (version 1.5.3).  The Apache OpenNLP
          library is a machine learning based toolkit for the processing of natural language
          text written in Java.  It supports the most common NLP tasks, such as tokenization,
          sentence segmentation, part-of-speech tagging, named entity extraction, chunking,
          parsing, and coreference resolution.  See http://opennlp.apache.org/ for more
          information.
Imports: NLP (>= 0.1-0), openNLPdata (>= 1.5.3-1), rJava (>= 0.6-3)
SystemRequirements: Java (>= 5.0)
License: GPL-3
Packaged: 2013-08-20 13:23:54 UTC; hornik
Author: Kurt Hornik [aut, cre]
Maintainer: Kurt Hornik <Kurt.Hornik@R-project.org>
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-08-20 15:41:22
Built: R 3.0.1; ; 2013-08-20 13:48:47 UTC; windows

推荐答案

由于您是从集群工作程序上的NLP调用函数的,因此应在调用parLapply之前将其加载到每个工作程序上.您可以通过worker函数来实现,但是我倾向于在创建集群对象之后立即使用clusterCallclusterEvalQ:

Since you're calling functions from NLP on the cluster workers, you should load it on each of the workers before calling parLapply. You can do that from the worker function, but I tend to use clusterCall or clusterEvalQ right after creating the cluster object:

clusterEvalQ(cl, {library(openNLP); library(NLP)})

由于as.StringMaxent_Word_Token_Annotator在这些程序包中,因此不应将其导出.

Since as.String and Maxent_Word_Token_Annotator are in those packages, they shouldn't be exported.

请注意,在我的计算机上运行示例时,我注意到PTA对象在导出到辅助计算机后不起作用.大概在那个对象中有一些不能安全地序列化和反序列化的东西.使用clusterEvalQ在工作人员上创建该对象后,该示例成功运行.使用openNLP 0.2-1在这里:

Note that while running your example on my machine, I noticed that the PTA object doesn't work after being exported to the worker machines. Presumably there is something in that object that can't be safely serialized and unserialized. After I created that object on the workers using clusterEvalQ, the example ran successfully. Here it is, using openNLP 0.2-1:

library(parallel)
tagPOS <-  function(x, ...) {
    s <- as.String(x)
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, PTA, a2)
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
}
text.var <- c("I like it.", "This is outstanding soup!",
    "I really must get the recipe.")
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {
    library(openNLP)
    library(NLP)
    PTA <- Maxent_POS_Tag_Annotator()
})
m <- parLapply(cl, text.var, tagPOS)
print(m)
stopCluster(cl)

如果由于未找到Maxent_POS_Tag_Annotator而导致clusterEvalQ失败,则可能是在工作程序上加载了错误版本的openNLP.您可以通过使用clusterEvalQclusterEvalQ来确定要使用的软件包版本:

If clusterEvalQ fails because Maxent_POS_Tag_Annotator is not found, you might be loading the wrong version of openNLP on the workers. You can determine what package versions you're getting on the workers by executing sessionInfo with clusterEvalQ:

library(parallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {library(openNLP); library(NLP)})
clusterEvalQ(cl, sessionInfo())

这将返回在每个集群工作程序上执行sessionInfo()的结果.这是我正在使用且对我有用的某些软件包的版本信息:

This will return the results of executing sessionInfo() on each of the cluster workers. Here is the version information for some of the packages that I'm using and that work for me:

other attached packages:
[1] NLP_0.1-0     openNLP_0.2-1

loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-1 rJava_0.9-4

这篇关于并行parLapply设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆