在 Windows 上使用包 XML 时出现内存泄漏 [英] Memory leak when using package XML on Windows

查看:30
本文介绍了在 Windows 上使用包 XML 时出现内存泄漏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已阅读内存泄漏在 r 中解析 XML(包括链接的帖子)和 this 关于 R 帮助的帖子鉴于又过了一段时间,我仍然认为这是一个未解决的问题,值得关注,因为 XML 包在整个 R 领域被广泛使用.

Having read Memory leaks parsing XML in r (including linked posts) and this post on R Help and given that some time has passed again, I still think this is an unresolved issue that deserves attention as the XML package is widely used throughout the R universe.

因此,请将此视为后续帖子和/或参考,希望能提供信息丰富但简洁的问题说明.

以一种可以用 XPath 搜索的方式解析 XML/HTML 文档需要C 指针的内部使用 (AFAIU).似乎至少在 MS Windows 上(我在 Windows 8.1、64 位上运行),垃圾收集器无法正确识别这些引用.因此消耗的内存没有正确释放,导致 R 进程在某些时候冻结.

Parsing XML/HTML documents in a way that they can be searched with XPath afterwards requires the internal use of C pointers (AFAIU). And it seems that at least on MS Windows (I'm running on Windows 8.1, 64 Bit) these references are not properly recognized by the garbage collector. Thus consumed memory is not properly released which leads to a freeze of an R process at some point.

在我看来,XML:free 和/或 gc 确实/不识别解析 XML/HTML 文档时涉及的所有内存通过 xmlParsehtmlParse 并随后使用 xpathApply 或类似方法处理它们:

To me it seems that XML:free and/or gc does/do not recognize all memory involved when parsing XML/HTML docs via xmlParse or htmlParse and subsequently processing them with xpathApply or the like:

操作系统任务 (Rterm.exe) 报告的内存使用量显着增加,而报告的 R 进程内存使用情况从 R 内部看到"(function memory.size) 适度增加(相比之下,即).请参阅下面的实质性解析循环前后的列表元素 mem_rmem_osratio.

The reported memory usage of the OS task (Rterm.exe) is adding up significantly fast while the reported memory of the R process as "seen from within R" (function memory.size) increases moderately (in comparison, that is). See list elements mem_r, mem_os and ratio before and after a substantial parsing cycle below.

总而言之,扔掉所有推荐的东西(freermgc),内存使用情况仍然总是 在调用 xmlParse 等时增加.只是多少的问题.所以恕我直言,一定还有一些不能正常工作的东西.

All in all and throwing in everything that has been recommended (free, rm and gc), memory usage still always increases when xmlParse and the like are called. It's just a question of how much. So IMHO there must still be something that's not working correctly.

我从 Duncan 的 Omegahat 借用了分析代码 git 存储库.

I borrowed the profiling code from the Duncan's Omegahat git repository.

一些准备工作:

Sys.setenv("LANGUAGE"="en")   
require("compiler")
require("XML")

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] XML_3.98-1.1

我们需要的功能:

getTaskMemoryByPid <- cmpfun(function(
    pid=Sys.getpid()
) {
    cmd <- sprintf("tasklist /FI "pid eq %s" /FO csv", pid)
    mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
    mem <- as.numeric(gsub("\.|\s|K", "", mem))/1000
    mem
}, options=list(suppressAll=TRUE))  

memoryLeak <- cmpfun(function(
    x=system.file("exampleData", "mtcars.xml", package="XML"),
    n=10000,
    use_text=FALSE,
    xpath=FALSE,
    free_doc=FALSE,
    clean_up=FALSE,
    detailed=FALSE
) {
    if(use_text) {
        x <- readLines(x)
    }
    ## Before //
    mem_os  <- getTaskMemoryByPid()
    mem_r   <- memory.size()
    prof_1  <- memory.profile()
    mem_before <- list(mem_r=mem_r,
        mem_os=mem_os, ratio=mem_os/mem_r)

    ## Per run //
    mem_perrun <- lapply(1:n, function(ii) {
        doc <- xmlParse(x, asText=use_text)
        if (xpath) {
            res <- xpathApply(doc=doc, path="/blah", fun=xmlValue)
            rm(res)
        }
        if (free_doc) {
            free(doc)
        }
        rm(doc)
        out <- NULL
        if (detailed) {
            out <- list(
                profile=memory.profile(),
                size=memory.size()
            )
        } 
        out
    })
    has_perrun <- any(sapply(mem_perrun, length) > 0)
    if (!has_perrun) {
        mem_perrun <- NULL
    } 

    ## Garbage collect //
    mem_gc <- NULL
    if(clean_up) {
        gc()
        tmp <- gc()
        mem_gc <- list(gc_mb=tmp["Ncells", "(Mb)"])
    }

    ## After //
    mem_os  <- getTaskMemoryByPid()
    mem_r   <- memory.size()
    prof_2  <- memory.profile()
    mem_after <- list(mem_r=mem_r,
        mem_os=mem_os, ratio=mem_os/mem_r)
    list(
        before=mem_before, 
        perrun=mem_perrun, 
        gc=mem_gc, 
        after=mem_after, 
        comparison_r=data.frame(
            before=prof_1, 
            after=prof_2, 
            increase=round((prof_2/prof_1)-1, 4)
        ),
        increase_r=(mem_after$mem_r/mem_before$mem_r)-1,
        increase_os=(mem_after$mem_os/mem_before$mem_os)-1
    )
}, options=list(suppressAll=TRUE))  

<小时>

结果

场景 1

快速事实:启用垃圾收集,XML 文档被解析 n 次,但 通过 xpathApply

注意 OS 内存与 R 内存的比率:

Notice the ratios of OS memory vs. R memory:

之前:1.364832

之后:1.322702

res <- memoryLeak(clean_up=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-1.rdata"))

> res
$before
$before$mem_r
[1] 37.42

$before$mem_os
[1] 51.072

$before$ratio
[1] 1.364832


$perrun
NULL

$gc
$gc$gc_mb
[1] 45


$after
$after$mem_r
[1] 63.21

$after$mem_os
[1] 83.608

$after$ratio
[1] 1.322702


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7387   7392   0.0007
pairlist    190383 390633   1.0518
closure       5077  55085   9.8499
environment   1032  51032  48.4496
promise       5226 105226  19.1351
language     54675  54791   0.0021
special         44     44   0.0000
builtin        648    648   0.0000
char          8746   8763   0.0019
logical       9081   9084   0.0003
integer      22804  22807   0.0001
double        2773   2783   0.0036
complex          1      1   0.0000
character    44522  94569   1.1241
...              0      0      NaN
any              0      0      NaN
list         19946  19951   0.0003
expression       1      1   0.0000
bytecode     16049  16050   0.0001
externalptr   1487   1487   0.0000
weakref        391    391   0.0000
raw            392    392   0.0000
S4            1392   1392   0.0000

$increase_r
[1] 0.6892036

$increase_os
[1] 0.6370614

场景 2

快速事实:启用垃圾收集,显式调用 free,解析 XML 文档 n 次但 通过 xpathApply 搜索.

Scenario 2

Quick facts: garbage collection enabled, free is explicitly called, XML doc is parsed n times but not searched via xpathApply.

注意 OS 内存与 R 内存的比率:

Notice the ratios of OS memory vs. R memory:

之前:1.315249

之后:1.222143

res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-2.rdata"))
> res

$before    
$before$mem_r
[1] 63.48

$before$mem_os
[1] 83.492

$before$ratio
[1] 1.315249


$perrun
NULL

$gc
$gc$gc_mb
[1] 69.3


$after
$after$mem_r
[1] 95.92

$after$mem_os
[1] 117.228

$after$ratio
[1] 1.222143


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7454   7454   0.0000
pairlist    392455 592466   0.5096
closure      55104 105104   0.9074
environment  51032 101032   0.9798
promise     105226 205226   0.9503
language     55592  55592   0.0000
special         44     44   0.0000
builtin        648    648   0.0000
char          8847   8848   0.0001
logical       9141   9141   0.0000
integer      23109  23111   0.0001
double        2802   2807   0.0018
complex          1      1   0.0000
character    94775 144781   0.5276
...              0      0      NaN
any              0      0      NaN
list         20174  20177   0.0001
expression       1      1   0.0000
bytecode     16265  16265   0.0000
externalptr   1488   1487  -0.0007
weakref        392    391  -0.0026
raw            393    392  -0.0025
S4            1392   1392   0.0000

$increase_r
[1] 0.5110271

$increase_os
[1] 0.4040627

场景 3

快速事实:启用垃圾收集,显式调用 free,解析 XML 文档 n 次并通过 xpathApply搜索/code> 每次.

Scenario 3

Quick facts: garbage collection enabled, free is explicitly called, XML doc is parsed n times and searched via xpathApply each time.

注意 OS 内存与 R 内存的比率:

Notice the ratios of OS memory vs. R memory:

之前:1.220429

之后:13.15629 (!)

res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, xpath=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-3.rdata"))
res
$before
$before$mem_r
[1] 95.94

$before$mem_os
[1] 117.088

$before$ratio
[1] 1.220429


$perrun
NULL

$gc
$gc$gc_mb
[1] 93.4


$after
$after$mem_r
[1] 124.64

$after$mem_os
[1] 1639.8

$after$ratio
[1] 13.15629


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7454   7460   0.0008
pairlist    592458 793042   0.3386
closure     105104 155110   0.4758
environment 101032 151032   0.4949
promise     205226 305226   0.4873
language     55592  55882   0.0052
special         44     44   0.0000
builtin        648    648   0.0000
char          8847   8867   0.0023
logical       9142   9162   0.0022
integer      23109  23112   0.0001
double        2802   2832   0.0107
complex          1      1   0.0000
character   144775 194819   0.3457
...              0      0      NaN
any              0      0      NaN
list         20174  20177   0.0001
expression       1      1   0.0000
bytecode     16265  16265   0.0000
externalptr   1488   1487  -0.0007
weakref        392    391  -0.0026
raw            393    392  -0.0025
S4            1392   1392   0.0000

$increase_r
[1] 0.2991453

$increase_os
[1] 13.00485

<小时>

我也尝试了不同的版本.好吧,我尝试尝试;-)

仅供参考:最新的 Rtools 3.1 已安装并包含在 Windows PATH 中(例如,安装 stringr 表单源代码工作得很好).

FYI: latest Rtools 3.1 is installed and included in the Windows PATH (e.g. installing stringr form the source code worked just fine).

> install.packages("XML", repos="http://www.omegahat.org/R", type="source")
trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz'
Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb)
opened URL
downloaded 1.5 Mb

* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'

The downloaded source packages are in
    'C:Users
appster_adminAppDataLocalTempRtmpQFZ2Ckdownloaded_packages'
Warning messages:
1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:homeappslsqmappsapps
R-3.1.0library" C:UsersRAPPST~1AppDataLocalTempRtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1 
2: In install.packages("XML", repos = "http://www.omegahat.org/R",  :
  installation of package 'XML' had non-zero exit status

Github

我没有遵循 github 存储库上 README 中的建议,因为它指向 这个目录只包含一个tar.gz的版本3.94-0(当我们在 CRAN 上的 3.98-1.1 时).

Github

I did not follow the recommendations in the README on the github repo as it points to this directory that only contains a tar.gz of version 3.94-0 (while we're at 3.98-1.1 on CRAN).

即使声明 gihub 存储库不在标准的 R 包结构中,我还是用 install_github 尝试了它 - 并且失败了 ;-)

Even though it is stated that the gihub repo is not in a standard R package structure, I tried it anyway with install_github - and failed ;-)

require("devtools")
> install_github(repo="XML", username="omegahat")
Installing github repo XML/master from omegahat
Downloading master.zip from https://github.com/omegahat/XML/archive/master.zip
Installing package from C:UsersRAPPST~1AppDataLocalTempRtmpQFZ2Ck/master.zip
Installing XML
"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" --vanilla CMD INSTALL  
  "C:Users
appster_adminAppDataLocalTempRtmpQFZ2Ckdevtools15c82d7c2b4cXML-master"  
  --library="R:/home/apps/lsqmapps/apps/r/R-3.1.0/library" --with-keep.source  
  --install-tests 

* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
Error: Command failed (1)

推荐答案

虽然它仍处于起步阶段(只有几个月大!),并且有一些怪癖,Hadley Wickham 已经编写了一个用于 XML 解析的库,xml2,可以在 Github 上的 https://github.com/hadley/xml2 上找到.它仅限于读取而不是写入 XML,但对于解析 XML,我一直在试验,看起来它可以完成这项工作,而不会出现 xml 包的内存泄漏!它提供的功能包括:

Whilst it is still in its infancy (only a couple of months old!), and has a few quirks, Hadley Wickham has written a library for XML parsing, xml2, that can be found on Github at https://github.com/hadley/xml2. It is restricted to reading rather than writing XML, but for parsing XML I've been experimenting and it looks like it will do the job, without the memory leaks of the xml package! It provides functions including:

  • read_xml() 读取 XML 文件
  • xml_children() 获取节点的子节点
  • xml_text() 获取标签内的文本
  • xml_attrs() 获取节点属性和值的字符向量,可以使用 as.list() 将其转换为命名列表莉>
  • read_xml() to read an XML file
  • xml_children() to get the child nodes of a node
  • xml_text() to get the text within a tag
  • xml_attrs() to get a character vector of the attributes and values of a node, that can be cast to a named list with as.list()

请注意,您仍然需要确保在使用完 XML 节点对象后rm(),并使用 gc() 强制进行垃圾收集,但内存确实会被释放到操作系统(免责声明:仅在 Windows 7 上测试过,但这似乎是最内存泄漏"的平台).

Note that you still need to ensure that you rm() the XML node objects after you're done with them, and force a garbage collection with gc(), but the memory then does actually get released to the O/S (Disclaimer: Only tested on Windows 7 but this seems to be the most 'memory leaky' platform anyway).

希望这对某人有所帮助!

Hope this helps someone!

这篇关于在 Windows 上使用包 XML 时出现内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆