并行 parLapply 设置 [英] parallel parLapply setup

查看:32
本文介绍了并行 parLapply 设置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试并行使用来自 openNLP/NLP 包的词性标记.我需要代码在任何操作系统上工作,所以我选择使用并行的 parLapply 函数(但对其他操作系统独立选项开放).过去,我从 parLapply 中的 openNLP 包中运行了 tagPOS 函数,没有任何问题.但是,openNLP 包最近进行了一些更改,消除了 tagPOS 并添加了一些更灵活的选项.Kurt 非常友好地帮助我从新包的工具中重新创建了 tagPOS 函数.我可以让 lapply 版本工作,但不能让并行版本工作.它一直说节点需要更多的变量传递给它们,直到它最终从 openNLP 请求一个非导出的函数.这看起来很奇怪,它会不断要求传递越来越多的变量,这告诉我我正在错误地设置 parLapply.如何设置 tagPOS 以并行的、独立于操作系统的方式运行?

I am trying to use part of speech tagging from the openNLP/NLP packages in parallel. I need the code to work on any OS so am opting to use the parLapply function from parallel (but am open to other OS independent options). In the past I ran tagPOS function from the openNLP package in parLapply with no problem. However, the openNLP package had some recent changes that eliminated tagPOS and added some more flexible options. Kurt was kind enough to help me recreate the tagPOS function from the new package's tools. I can get the lapply version to work but not the parallel version. It keeps saying the nodes need more variables passed to them until it finally asks for a non-exported function from openNLP. This seems odd it would keep asking for more and more variables to be passed which tells me I'm setting up the parLapply incorrectly. How can I set up the tagPOS to operate in an parallel, OS independent fashion?

library(openNLP)
library(NLP)
library(parallel)

## POS tagger
tagPOS <-  function(x, pos_tag_annotator, ...) {
    s <- as.String(x)
    ## Need sentence and word token annotations.
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, pos_tag_annotator, a2)

    ## Determine the distribution of POS tags for word tokens.
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))

    ## Extract token/POS pairs (all of them): easy.
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
} ## End of tagPOS function 

## Set up a parallel run
text.var <- c("I like it.", "This is outstanding soup!",  
    "I really must get the recipe.")
ntv <- length(text.var)
PTA <- Maxent_POS_Tag_Annotator()   

cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterExport(cl=cl, varlist=c("text.var", "ntv", 
    "tagPOS", "PTA", "as.String", "Maxent_Word_Token_Annotator"), 
    envir = environment())
m <- parLapply(cl, seq_len(ntv), function(i) {
        x <- tagPOS(text.var[i], PTA)
        return(x)
    }
)
stopCluster(cl)

## Error in checkForRemoteErrors(val) : 
##   3 nodes produced errors; first error: could not find function 
##   "Maxent_Simple_Word_Tokenizer"

openNLP::Maxent_Simple_Word_Tokenizer

## >openNLP::Maxent_Simple_Word_Tokenizer
## Error: 'Maxent_Simple_Word_Tokenizer' is not an exported 
##     object from 'namespace:openNLP'

## It's a non exported function
openNLP:::Maxent_Simple_Word_Tokenizer


## Demo that it works with lapply
lapply(seq_len(ntv), function(i) {
    tagPOS(text.var[i], PTA)
})

lapply(text.var, function(x) {
    tagPOS(x, PTA)
})

## >     lapply(seq_len(ntv), function(i) {
## +         tagPOS(text.var[i], PTA)
## +     })
## [[1]]
## [[1]]$POStagged
## [1] "I/PRP like/IN it/PRP ./."
## 
## [[1]]$POStags
## [1] "PRP" "IN"  "PRP" "."  
## 
## [[1]]$word.count
## [1] 3
## 
## 
## [[2]]
## [[2]]$POStagged
## [1] "THis/DT is/VBZ outstanding/JJ soup/NN !/."
## 
## [[2]]$POStags
## [1] "DT"  "VBZ" "JJ"  "NN"  "."  
## 
## [[2]]$word.count
## [1] 4
## 
## 
## [[3]]
## [[3]]$POStagged
## [1] "I/PRP really/RB must/MD get/VB the/DT recip/NN ./."
## 
## [[3]]$POStags
## [1] "PRP" "RB"  "MD"  "VB"  "DT"  "NN"  "."  
## 
## [[3]]$word.count
## [1] 6

根据史蒂夫的建议

注意 openNLP 是全新的.我从 CRAN 的 tar.gz 安装了 2.1 版.即使此函数存在,我也会收到以下错误.

Note the openNLP is brand new. I installed ver 2.1 from a tar.gz from CRAN. I get the following error even though this function exists.

library(openNLP); library(NLP); library(parallel)

tagPOS <-  function(text.var, pos_tag_annotator, ...) {
    s <- as.String(text.var)

    ## Set up the POS annotator if missing (for parallel)
    if (missing(pos_tag_annotator)) {
        PTA <- Maxent_POS_Tag_Annotator()
    }

    ## Need sentence and word token annotations.
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, PTA, a2)

    ## Determine the distribution of POS tags for word tokens.
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, "[[", "POS"))

    ## Extract token/POS pairs (all of them): easy.
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
}

text.var <- c("I like it.", "This is outstanding soup!",  
    "I really must get the recipe.")

cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {library(openNLP); library(NLP)})
m <- parLapply(cl, text.var, tagPOS)

## > m <- parLapply(cl, text.var, tagPOS)
## Error in checkForRemoteErrors(val) : 
##   3 nodes produced errors; first error: could not find function "Maxent_POS_Tag_Annotator"

stopCluster(cl)


> packageDescription('openNLP')
Package: openNLP
Encoding: UTF-8
Version: 0.2-1
Title: Apache OpenNLP Tools Interface
Authors@R: person("Kurt", "Hornik", role = c("aut", "cre"), email =
          "Kurt.Hornik@R-project.org")
Description: An interface to the Apache OpenNLP tools (version 1.5.3).  The Apache OpenNLP
          library is a machine learning based toolkit for the processing of natural language
          text written in Java.  It supports the most common NLP tasks, such as tokenization,
          sentence segmentation, part-of-speech tagging, named entity extraction, chunking,
          parsing, and coreference resolution.  See http://opennlp.apache.org/ for more
          information.
Imports: NLP (>= 0.1-0), openNLPdata (>= 1.5.3-1), rJava (>= 0.6-3)
SystemRequirements: Java (>= 5.0)
License: GPL-3
Packaged: 2013-08-20 13:23:54 UTC; hornik
Author: Kurt Hornik [aut, cre]
Maintainer: Kurt Hornik <Kurt.Hornik@R-project.org>
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-08-20 15:41:22
Built: R 3.0.1; ; 2013-08-20 13:48:47 UTC; windows

推荐答案

由于您在集群工作线程上从 NLP 调用函数,因此您应该在调用 parLapply.您可以从工作函数中执行此操作,但我倾向于在创建集群对象后立即使用 clusterCallclusterEvalQ:

Since you're calling functions from NLP on the cluster workers, you should load it on each of the workers before calling parLapply. You can do that from the worker function, but I tend to use clusterCall or clusterEvalQ right after creating the cluster object:

clusterEvalQ(cl, {library(openNLP); library(NLP)})

由于 as.StringMaxent_Word_Token_Annotator 在这些包中,所以不应导出它们.

Since as.String and Maxent_Word_Token_Annotator are in those packages, they shouldn't be exported.

请注意,在我的机器上运行您的示例时,我注意到 PTA 对象在导出到工作机器后不起作用.据推测,该对象中存在无法安全序列化和反序列化的内容.在我使用 clusterEvalQ 在 worker 上创建该对象后,示例成功运行.在这里,使用 openNLP 0.2-1:

Note that while running your example on my machine, I noticed that the PTA object doesn't work after being exported to the worker machines. Presumably there is something in that object that can't be safely serialized and unserialized. After I created that object on the workers using clusterEvalQ, the example ran successfully. Here it is, using openNLP 0.2-1:

library(parallel)
tagPOS <-  function(x, ...) {
    s <- as.String(x)
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, PTA, a2)
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
}
text.var <- c("I like it.", "This is outstanding soup!",
    "I really must get the recipe.")
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {
    library(openNLP)
    library(NLP)
    PTA <- Maxent_POS_Tag_Annotator()
})
m <- parLapply(cl, text.var, tagPOS)
print(m)
stopCluster(cl)

如果 clusterEvalQ 由于找不到 Maxent_POS_Tag_Annotator 而失败,您可能在工作线程上加载了错误版本的 openNLP.您可以通过使用 clusterEvalQ 执行 sessionInfo 来确定您在 worker 上获得的软件包版本:

If clusterEvalQ fails because Maxent_POS_Tag_Annotator is not found, you might be loading the wrong version of openNLP on the workers. You can determine what package versions you're getting on the workers by executing sessionInfo with clusterEvalQ:

library(parallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {library(openNLP); library(NLP)})
clusterEvalQ(cl, sessionInfo())

这将返回在每个集群工作器上执行 sessionInfo() 的结果.以下是我正在使用且对我有用的一些软件包的版本信息:

This will return the results of executing sessionInfo() on each of the cluster workers. Here is the version information for some of the packages that I'm using and that work for me:

other attached packages:
[1] NLP_0.1-0     openNLP_0.2-1

loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-1 rJava_0.9-4

这篇关于并行 parLapply 设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆