迭代解析 XML 文件时出现严重的内存泄漏 [英] Serious Memory Leak When Iteratively Parsing XML Files

查看:85
本文介绍了迭代解析 XML 文件时出现严重的内存泄漏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当迭代一组已加载、分析的 Rdata 文件(每个文件包含一个 HTML 代码的字符向量)时(通过 XML 功能),然后再次从内存中删除,我体验到显着R 进程的内存消耗(最终杀死进程).

When iterating over a set of Rdata files (each containing a character vector of HTML code) that are loaded, analyzed (via XML functionality) and then removed from memory again, I experience a significant increase in an R process' memory consumption (killing the process eventually).

好像是

  • 通过free()
  • 释放对象
  • 通过 rm()
  • 删除它们
  • 运行 gc()

没有任何影响,因此内存消耗会不断累积,直到没有更多内存可用.

do not have any effects, so the memory consumption cumulates until there's no more memory left.

感谢包的作者和维护者分享的宝贵见解 XML,Duncan Temple Lang(再次:我真的很感激!),问题似乎与外部指针的释放方式以及XML包中垃圾收集的处理方式密切相关.Duncan 发布了包 (3.92-0) 的错误修复版本,该版本整合了解析 XML 和 HTML 的某些方面,并具有改进的垃圾收集功能,不再需要通过 free().您可以在 Duncan 的 Omegahat 网站 上找到源代码和 Windows 二进制文件.

Thanks to valuable insight shared by the author and maintainer of package XML, Duncan Temple Lang (again: I really appreciate it very much!), the problem seems to be closely related to the way external pointers are freed and how garbage collection is handled in the XML package. Duncan issued a bug-fixed version of the package (3.92-0) that consolidated certain aspects of parsing XML and HTML and features an improved garbage collection where it's not necessary anymore to explicitly free the object containing the external pointer via free(). You find the source code and a Windows binary at Duncan's Omegahat website.

不幸的是,新的包版本似乎仍然没有解决我在我放在一起的小例子中遇到的问题.我遵循了一些建议并稍微简化了示例,使其更容易掌握并找到似乎出错的相关函数(检查函数 ./lib/exampleRun.R.lib/scrape.R).

Unfortunately, the new package version still does not seem to fix the issues I'm encountering in the little little example that I've put together. I followed some suggestion and simplified the example a bit, making it easier to grasp and to find the relevant functions where things seem to go wrong (check functions ./lib/exampleRun.R and .lib/scrape.R).

Duncan 建议尝试通过 .Call("RS_XML_forceFreeDoc", html) 显式地强制释放解析的文档.我在示例中包含了一个逻辑开关(脚本 ./scripts/memory.R 中的 do.forcefree),如果设置为 TRUE>,会这样做.不幸的是,这让我的 R 控制台崩溃了.如果有人可以在他们的机器上验证这一点,那就太好了!实际上,当使用最新版本的 XML 时,文档应该被自动释放(见上文).事实上,它似乎不是一个错误(根据邓肯的说法).

Duncan suggested trying to force to free the parsed document explicitly via .Call("RS_XML_forceFreeDoc", html). I've included a logical switch in the example (do.forcefree in script ./scripts/memory.R) that, if set to TRUE, will do just that. Unfortunately, this made my R console crash. It'd be great if someone could verify this on their machine! Actually, the doc should be freed automatically when using the latest version of XML (see above). The fact that it isn't seems to be a bug (according to Duncan).

Duncan 将另一个版本的 XML (3.92-1) 推送到他的 Omegahat 网站 Omegahat 网站.这应该可以解决一般问题.但是,我的示例似乎不太走运,因为我仍然遇到相同的内存泄漏.

Duncan pushed yet another version of XML (3.92-1) to his Omegahat website Omegahat website. This should fix the issue in general. However, I seem to be out of luck with my example as I still experience the same memory leakage.

是的!邓肯发现并修复了错误!这是一个仅适用于 Windows 的脚本中的一个小错误,它解释了为什么该错误没有在 Linux、Mac OS 等中显示.查看最新版本 3.92-2.!内存消耗现在与迭代解析和处理 XML 文件时一样恒定!

YES! Duncan found and fixed the bug! It was a little typo in a Windows-only script which explained why the bug didn't show in Linux, Mac OS etc. Check out the latest version 3.92-2.! Memory consumption is now as constant as can be when iteratively parsing and processing XML files!

再次特别感谢 Duncan Temple Lang 和所有回答这个问题的人!

Special thanks again to Duncan Temple Lang and thanks to everyone else that responded to this question!

  1. 从我的 Github 存储库中下载文件夹 'memory'/strong>.
  2. 打开脚本 ./scripts/memory.R 并在 第 6 行 处设置 a) 您的工作目录,b) 示例范围第 16 行 以及 c) 是否在 第 22 行 强制释放解析的文档.请注意,您仍然可以找到旧脚本;它们由文件名末尾的LEGACY"标记".
  3. 运行脚本.
  4. 调查最新文件 ./memory_.txt 以查看记录的内存状态随时间增加的情况.我已经包含了两个由我自己的测试运行产生的文本文件.
  1. Download folder 'memory' from my Github repo.
  2. Open up the script ./scripts/memory.R and set a) your working directory at line 6, b) the example scope at line 16 as well c) whether to force the freeing of the parsed doc or not at line 22. Note that you can still find the old scripts; they are "tagged" by an "LEGACY" at the end of the filename.
  3. Run the script.
  4. Investigate the latest file ./memory_<TIMESTAMP>.txt to see the increase in logged memory states over time. I've included two text files that resulted from my own test runs.

我在内存控制方面所做的事情

  • 确保在每次迭代结束时通过 rm() 再次删除加载的对象.
  • 解析 XML 文件时,我设置了参数 addFinalizer=TRUE,在通过 free() 释放 C 指针之前删除了所有引用解析的 XML 文档的 R 对象 并删除包含外部指针的对象.
  • 在这里和那里添加 gc().
  • 尝试遵循 Duncan Temple Lang 的注释 在使用其 XML 包时的内存管理(我不得不承认,虽然我没有完全理解那里陈述的内容)
  • Things I've done with respect to memory control

    • making sure a loaded object is removed again via rm() at the end of each iteration.
    • When parsing XML files, I've set argument addFinalizer=TRUE, removed all R objects that have a reference to the parsed XML doc before freeing the C pointer via free() and removing the object containing the external pointer.
    • adding a gc() here and there.
    • trying to follow the advice in Duncan Temple Lang's notes on memory management when using its XML package (I have to admit though that I did not fully comprehend what's stated there)
    • 编辑 2012-02-13 23:42:00:正如我在上面指出的,不再需要显式调用 free() 后跟 rm(),所以我注释掉了这些调用.

      EDIT 2012-02-13 23:42:00: As I pointed out above, explicit calls to free() followed by rm() should not be necessary anymore, so I commented these calls out.

      • Windows XP 32 位,4 GB 内存
      • Windows 7 32 位,2 GB 内存
      • Windows 7 64 位,4 GB 内存
      • R 2.14.1
      • XML 3.9-4
      • XML 3.92-0,见 http://www.omegahat.org/RSXML/
      1. 在多台机器上运行 webscraping 场景(请参阅上面的系统信息"部分)在大约 180 - 350 次迭代(取决于操作系统和 RAM)后总是会破坏我的 R 进程的内存消耗.
      2. 运行纯 rdata 方案会产生恒定的内存消耗当且仅当您通过 gc() 设置对垃圾收集器的显式调用在每次迭代中;否则,您会遇到与网页抓取场景相同的行为.
      1. Running the webscraping scenario on several machines (see section "System Info" above) always busts the memory consumption of my R process after about 180 - 350 iterations (depending on OS and RAM).
      2. Running the plain rdata scenario yields constant memory consumption if and only if you set an explicit call to the garbage collector via gc() in each iteration; else you experience the same behavior as in the webscraping scenario.

      问题

      1. 知道是什么导致了内存增加吗?
      2. 有什么想法可以解决这个问题吗?

      调查结果截至 2012-02-013 23:44:00

      在多台机器上运行 ./scripts/memory.R 中的示例(参见上面的系统信息"部分)在大约 180 - 350 次迭代(取决于在操作系统和 RAM 上).

      Findings as of 2012-02-013 23:44:00

      Running the example in ./scripts/memory.R on several machines (see section "System Info" above) still busts the memory consumption of my R process after about 180 - 350 iterations (depending on OS and RAM).

      内存消耗仍然明显增加,尽管仅查看数字时似乎没有那么多,但我的 R 进程总是因此而在某个时候死亡.

      There's still an evident increase in memory consumption and even though it may not appear to be that much when just looking at the numbers, my R processes always died at some point due to this.

      下面,我发布了几个时间序列,这些时间序列是在具有 2 GB RAM 的 WinXP 32 位机器上运行我的示例的结果:

      Below, I've posted a couple of time series that resulted from running my example on a WinXP 32 Bit box with 2 GB RAM:

      29.0733.3230.5535.3230.7630.9431.1331.3335.4432.3433.2132.1835.4635.7335.7635.6835.8435.633.4933.5833.7133.8233.9134.0434.1534.2337.8534.6834.8835.0535.235.435.5235.6635.8135.9138.0836.2

      29.07 33.32 30.55 35.32 30.76 30.94 31.13 31.33 35.44 32.34 33.21 32.18 35.46 35.73 35.76 35.68 35.84 35.6 33.49 33.58 33.71 33.82 33.91 34.04 34.15 34.23 37.85 34.68 34.88 35.05 35.2 35.4 35.52 35.66 35.81 35.91 38.08 36.2

      28.5430.1332.9530.3330.4330.5435.8130.9932.7831.3731.5635.2231.9932.2232.5532.6632.8435.3233.5933.3233.4733.5833.6933.7633.8735.535.5234.2437.6734.7534.9235.137.9735.4335.5735.738.1235.98

      28.54 30.13 32.95 30.33 30.43 30.54 35.81 30.99 32.78 31.37 31.56 35.22 31.99 32.22 32.55 32.66 32.84 35.32 33.59 33.32 33.47 33.58 33.69 33.76 33.87 35.5 35.52 34.24 37.67 34.75 34.92 35.1 37.97 35.43 35.57 35.7 38.12 35.98

      [...]
      Scraping html page 30 of ~/data/rdata/132.rdata
      Scraping html page 31 of ~/data/rdata/132.rdata
      error : Memory allocation failed : growing buffer
      error : Memory allocation failed : growing buffer
      I/O error : write error
      Scraping html page 32 of ~/data/rdata/132.rdata
      Fehler in htmlTreeParse(file = obj[x.html], useInternalNodes = TRUE, addFinalizer =     TRUE): 
       error in creating parser for (null)
      > Synch18832464393836
      

      TS_3 (XML 3.92-0, 2012-02-13)

      20.124.1424.4722.0325.2125.5423.1523.526.7124.627.3924.9328.0625.6428.7426.3629.327.0730.0127.7728.1331.1328.8431.7929.5432.430.2533.0730.9633.7631.6634.432.3735.133.0735.7738.2334.1634.5134.8735.2235.5835.9340.5440.941.3341.6

      TS_3 (XML 3.92-0, 2012-02-13)

      20.1 24.14 24.47 22.03 25.21 25.54 23.15 23.5 26.71 24.6 27.39 24.93 28.06 25.64 28.74 26.36 29.3 27.07 30.01 27.77 28.13 31.13 28.84 31.79 29.54 32.4 30.25 33.07 30.96 33.76 31.66 34.4 32.37 35.1 33.07 35.77 38.23 34.16 34.51 34.87 35.22 35.58 35.93 40.54 40.9 41.33 41.6

      [...]
      ---------- status: 31.33 % ----------
      
      Scraping html page 1 of 50
      Scraping html page 2 of 50
      [...]
      Scraping html page 36 of 50
      Scraping html page 37 of 50
      Fehler: 1: Memory allocation failed : growing buffer
      2: Memory allocation failed : growing buffer
      

      <小时>

      编辑 2012-02-17:请帮我验证计数器值

      如果你能运行下面的代码,你会帮我一个大忙的.不会花费您超过 2 分钟的时间.你只需要

      1. 下载 Rdata 文件并保存它作为 seed.Rdata.
      2. 下载包含我的抓取功能的脚本和将其另存为 scrape.R.
      3. 在相应地设置工作目录后获取以下代码.
      1. Download an Rdata file and save it as seed.Rdata.
      2. Download the script containing my scraping function and save it as scrape.R.
      3. Source the following code after setting the working directory accordingly.

      代码:

      setwd("set/path/to/your/wd")
      install.packages("XML", repos="http://www.omegahat.org/R")
      library(XML)
      source("scrape.R")
      load("seed.rdata")
      html <- htmlParse(obj[1], asText = TRUE)
      counter.1 <- .Call("R_getXMLRefCount", html)
      print(counter.1)
      z <- scrape(html)
      gc()
      gc()
      counter.2 <- .Call("R_getXMLRefCount", html)
      print(counter.2)
      rm(html)
      gc()
      gc()
      

      我对 counter.1counter.2 的值特别感兴趣,它们应该1> 在两次通话中.事实上,Duncan 已经在所有机器上进行了测试.然而,事实证明 counter.2 在我所有的机器上都具有 259 值(请参阅上面的详细信息),这正是导致我出现问题的原因.

      I'm particularly interested in the values of counter.1 and counter.2 which should be 1 in both calls. In fact, it is on all machines that Duncan has tested this on. However, as it turns out counter.2 has value 259 on all of my machines (see details above) and that's exactly what's causing my problem.

      推荐答案

      XML 包的网页来看,作者 Duncan Temple Lang 似乎已经相当广泛地描述了某些内存管理问题.请参阅此页面:XML 包中的内存管理".

      From the XML package's webpage, it seems that the author, Duncan Temple Lang, has quite extensively described certain memory management issues. See this page: "Memory Management in the XML Package".

      老实说,我不精通您的代码和包的详细情况,但我认为您可以在该页面中找到答案,特别是在名为 "问题",或与 Duncan Temple Lang 直接沟通.

      Honestly, I'm not proficient in the details of what's going on here with your code and the package, but I think you'll either find the answer in that page, specifically in the section called "Problems", or in direct communication with Duncan Temple Lang.

      更新 1.可能可行的一个想法是使用 multicoreforeach 包(即 listResults = foreach(ix= 1:N) %dopar% {your processing;return(listElement)}.我认为对于 Windows,你需要 doSMP,或者 doRedis; 在 Linux 下,我使用 doMC.无论如何,通过并行化加载,您将获得更快的吞吐量.我认为您可以从内存使用中获得一些好处的原因是它可能是分叉R,可能会导致不同的内存清理,因为每个生成的进程在完成时都会被杀死.这不能保证有效,但它可以解决内存和速度问题.

      Update 1. An idea that might work is to use the multicore and foreach packages (i.e. listResults = foreach(ix = 1:N) %dopar% {your processing;return(listElement)}. I think that for Windows you'll need doSMP, or maybe doRedis; under Linux, I use doMC. In any case, by parallelizing the loading, you'll get faster throughput. The reason I think you may get some benefit from memory usage is that it could be that forking R, could lead to different memory cleaning, as each spawned process gets killed when complete. This isn't guaranteed to work, but it could address both memory and speed issues.

      请注意:doSMP 有它自己的特性(即你可能仍然有一些内存问题).SO 上的其他问答也提到了一些问题,但我还是想试一试.

      Note, though: doSMP has its own idiosyncracies (i.e. you may still have some memory issues with it). There have been other Q&As on SO that mentioned some issues, but I'd still give it a shot.

      这篇关于迭代解析 XML 文件时出现严重的内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆