迭代解析XML文件时出现严重的内存泄漏 [英] Serious Memory Leak When Iteratively Parsing XML Files
问题描述
遍历一组加载的Rdata文件(每个文件都包含HTML代码的字符向量)时,将进行分析(通过
When iterating over a set of Rdata files (each containing a character vector of HTML code) that are loaded, analyzed (via XML functionality) and then removed from memory again, I experience a significant increase in an R process' memory consumption (killing the process eventually).
好像
- 通过
free()
, 释放对象
- 通过
rm()
和 删除它们
- 运行
gc()
- freeing objects via
free()
, - removing them via
rm()
and - running
gc()
没有任何影响,因此内存消耗会累积,直到没有剩余的内存为止.
do not have any effects, so the memory consumption cumulates until there's no more memory left.
感谢包> 的作者和维护者分享的宝贵见解> XML ,Duncan Temple Lang(再次:我非常感谢!),这个问题似乎与释放外部指针的方式以及XML包中垃圾回收的处理方式密切相关. . Duncan发布了该程序包的一个错误修复版本(3.92-0),该版本合并了解析XML和HTML的某些方面,并具有改进的垃圾回收功能,不再需要通过free()
显式释放包含外部指针的对象.您可以在Duncan的 Omegahat网站中找到源代码和Windows二进制文件.
Thanks to valuable insight shared by the author and maintainer of package XML, Duncan Temple Lang (again: I really appreciate it very much!), the problem seems to be closely related to the way external pointers are freed and how garbage collection is handled in the XML package. Duncan issued a bug-fixed version of the package (3.92-0) that consolidated certain aspects of parsing XML and HTML and features an improved garbage collection where it's not necessary anymore to explicitly free the object containing the external pointer via free()
. You find the source code and a Windows binary at Duncan's Omegahat website.
不幸的是,新的软件包版本似乎仍无法解决我整理的一个小例子中遇到的问题.我遵循了一些建议,并简化了示例,使其更易于掌握并查找可能出现问题的相关功能(请检查功能./lib/exampleRun.R
和.lib/scrape.R
).
Unfortunately, the new package version still does not seem to fix the issues I'm encountering in the little little example that I've put together. I followed some suggestion and simplified the example a bit, making it easier to grasp and to find the relevant functions where things seem to go wrong (check functions ./lib/exampleRun.R
and .lib/scrape.R
).
邓肯建议尝试通过.Call("RS_XML_forceFreeDoc", html)
显式释放已解析的文档.我在示例中包含了一个逻辑开关(脚本./scripts/memory.R
中的do.forcefree
),如果将其设置为TRUE
,它将执行此操作.不幸的是,这使我的R控制台崩溃了.如果有人可以在自己的计算机上验证这一点,那就太好了!实际上,使用最新版本的XML(请参见上文)时,文档应该被自动释放.事实并非如此(根据邓肯的说法).
Duncan suggested trying to force to free the parsed document explicitly via .Call("RS_XML_forceFreeDoc", html)
. I've included a logical switch in the example (do.forcefree
in script ./scripts/memory.R
) that, if set to TRUE
, will do just that. Unfortunately, this made my R console crash. It'd be great if someone could verify this on their machine! Actually, the doc should be freed automatically when using the latest version of XML (see above). The fact that it isn't seems to be a bug (according to Duncan).
Duncan向他的Omegahat网站 Omegahat网站推出了另一版本的XML(3.92-1).总体而言,这应该可以解决该问题.但是,我的示例似乎不走运,因为我仍然遇到相同的内存泄漏.
Duncan pushed yet another version of XML (3.92-1) to his Omegahat website Omegahat website. This should fix the issue in general. However, I seem to be out of luck with my example as I still experience the same memory leakage.
是的! Duncan发现并修复了该错误!这是纯Windows脚本中的一个小错字,它解释了为什么该错误在Linux,Mac OS等系统中未显示.请查看最新版本 3.92-2.!现在,内存消耗与迭代解析和处理XML文件时一样恒定!
YES! Duncan found and fixed the bug! It was a little typo in a Windows-only script which explained why the bug didn't show in Linux, Mac OS etc. Check out the latest version 3.92-2.! Memory consumption is now as constant as can be when iteratively parsing and processing XML files!
再次特别感谢Duncan Temple Lang,并感谢其他回答此问题的人!
Special thanks again to Duncan Temple Lang and thanks to everyone else that responded to this question!
- 从我的 Github存储库 .
- 打开脚本
./scripts/memory.R
并在a)第6行中设置您的工作目录,b)在中设置示例范围 >第16行,以及c)是否在第22行上强制释放已解析的文档.请注意,您仍然可以找到旧的脚本.它们在文件名末尾带有" LEGACY "标签. - 运行脚本.
- 研究最新文件
./memory_<TIMESTAMP>.txt
,以查看记录的内存状态随时间的增加.我包括了两个文本文件,这些文件是我自己的测试运行产生的.
- Download folder 'memory' from my Github repo.
- Open up the script
./scripts/memory.R
and set a) your working directory at line 6, b) the example scope at line 16 as well c) whether to force the freeing of the parsed doc or not at line 22. Note that you can still find the old scripts; they are "tagged" by an "LEGACY" at the end of the filename. - Run the script.
- Investigate the latest file
./memory_<TIMESTAMP>.txt
to see the increase in logged memory states over time. I've included two text files that resulted from my own test runs.
我在内存控制方面所做的事情
- 确保在每次迭代结束时通过
rm()
再次删除已加载的对象. - 在解析XML文件时,我设置了参数
addFinalizer=TRUE
,在通过free()
释放C指针并删除包含外部指针的对象之前,删除了所有具有对已解析XML文档的引用的R对象. li>
- 在这里和那里添加
gc()
. - 试图遵循Duncan Temple Lang的> 注释"中的建议/a>关于使用XML包时的内存管理(我必须承认,尽管我没有完全理解那里的内容)
- making sure a loaded object is removed again via
rm()
at the end of each iteration. - When parsing XML files, I've set argument
addFinalizer=TRUE
, removed all R objects that have a reference to the parsed XML doc before freeing the C pointer viafree()
and removing the object containing the external pointer. - adding a
gc()
here and there. - trying to follow the advice in Duncan Temple Lang's notes on memory management when using its XML package (I have to admit though that I did not fully comprehend what's stated there)
- Windows XP 32位,4 GB RAM
- Windows 7 32位,2 GB RAM
- Windows 7 64位4 GB RAM
- R 2.14.1
- XML 3.9-4 在 http://www.omegahat.org/RSXML/中找到的
- XML 3.92-0
- Windows XP 32 Bit, 4 GB RAM
- Windows 7 32 Bit, 2 GB RAM
- Windows 7 64 Bit, 4 GB RAM
- R 2.14.1
- XML 3.9-4
- XML 3.92-0 as found at http://www.omegahat.org/RSXML/
- 在多台机器上运行webscraping方案(请参阅上面的系统信息"部分)总是会在大约180-350次迭代(取决于OS和RAM)后破坏我R进程的内存消耗.
- 运行普通rdata方案会产生恒定的内存消耗, 当且仅当 在每次迭代中通过
gc()
设置对垃圾回收器的显式调用时;否则,您会遇到与网络抓取场景相同的行为. - Running the webscraping scenario on several machines (see section "System Info" above) always busts the memory consumption of my R process after about 180 - 350 iterations (depending on OS and RAM).
- Running the plain rdata scenario yields constant memory consumption if and only if you set an explicit call to the garbage collector via
gc()
in each iteration; else you experience the same behavior as in the webscraping scenario. - 有什么想法导致内存增加吗?
- 有什么想法可以解决这个问题吗?
Things I've done with respect to memory control
EDIT 2012-02-13 23:42:00:
正如我在上面指出的那样,不再需要先对free()
后跟rm()
进行显式调用,因此我将这些调用注释掉了.
EDIT 2012-02-13 23:42:00:
As I pointed out above, explicit calls to free()
followed by rm()
should not be necessary anymore, so I commented these calls out.
问题
截至2012-02-013 23:44:00的发现
在多台机器上运行./scripts/memory.R
中的示例(请参见上面的系统信息"一节),在大约180-350次迭代(取决于OS和RAM)之后,仍然破坏了我R进程的内存消耗.
Findings as of 2012-02-013 23:44:00
Running the example in ./scripts/memory.R
on several machines (see section "System Info" above) still busts the memory consumption of my R process after about 180 - 350 iterations (depending on OS and RAM).
内存消耗仍然有明显增加,尽管仅看数字似乎并没有那么多,但由于这个原因,我的R进程总是在某个时候死掉.
There's still an evident increase in memory consumption and even though it may not appear to be that much when just looking at the numbers, my R processes always died at some point due to this.
下面,我发布了一些时间序列,这些时间序列是在具有2 GB RAM的WinXP 32位盒上运行示例后得出的:
Below, I've posted a couple of time series that resulted from running my example on a WinXP 32 Bit box with 2 GB RAM:
29.07 33.32 30.55 35.32 30.76 30.94 31.13 31.33 35.44 32.34 33.21 32.18 35.46 35.73 35.76 35.68 35.84 35.6 33.49 33.58 33.71 33.82 33.91 34.04 34.15 34.23 37.85 34.68 34.88 35.05 35.2 35.4 35.52 35.66 35.81 35.91 38.08 36.2
29.07 33.32 30.55 35.32 30.76 30.94 31.13 31.33 35.44 32.34 33.21 32.18 35.46 35.73 35.76 35.68 35.84 35.6 33.49 33.58 33.71 33.82 33.91 34.04 34.15 34.23 37.85 34.68 34.88 35.05 35.2 35.4 35.52 35.66 35.81 35.91 38.08 36.2
28.54 30.13 32.95 30.33 30.43 30.54 35.81 30.99 32.78 31.37 31.56 35.22 31.99 32.22 32.55 32.66 32.84 35.32 33.59 33.32 33.47 33.58 33.69 33.76 33.87 35.5 35.52 34.24 37.67 34.75 34.92 35.1 37.97 35.43 35.57 35.7 38.12 35.98
28.54 30.13 32.95 30.33 30.43 30.54 35.81 30.99 32.78 31.37 31.56 35.22 31.99 32.22 32.55 32.66 32.84 35.32 33.59 33.32 33.47 33.58 33.69 33.76 33.87 35.5 35.52 34.24 37.67 34.75 34.92 35.1 37.97 35.43 35.57 35.7 38.12 35.98
[...]
Scraping html page 30 of ~/data/rdata/132.rdata
Scraping html page 31 of ~/data/rdata/132.rdata
error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
I/O error : write error
Scraping html page 32 of ~/data/rdata/132.rdata
Fehler in htmlTreeParse(file = obj[x.html], useInternalNodes = TRUE, addFinalizer = TRUE):
error in creating parser for (null)
> Synch18832464393836
TS_3(XML 3.92-0,2012-02-13)
20.1 24.14 24.47 22.03 25.21 25.54 23.15 23.5 26.71 24.6 27.39 24.93 28.06 25.64 28.74 26.36 29.3 27.07 30.01 27.77 28.13 31.13 28.84 31.79 29.54 32.4 30.25 33.07 30.96 33.76 31.66 34.4 32.37 35.1 33.07 35.77 38.23 34.16 34.51 34.87 35.22 35.58 35.93 40.54 40.9 41.33 41.6
TS_3 (XML 3.92-0, 2012-02-13)
20.1 24.14 24.47 22.03 25.21 25.54 23.15 23.5 26.71 24.6 27.39 24.93 28.06 25.64 28.74 26.36 29.3 27.07 30.01 27.77 28.13 31.13 28.84 31.79 29.54 32.4 30.25 33.07 30.96 33.76 31.66 34.4 32.37 35.1 33.07 35.77 38.23 34.16 34.51 34.87 35.22 35.58 35.93 40.54 40.9 41.33 41.6
[...]
---------- status: 31.33 % ----------
Scraping html page 1 of 50
Scraping html page 2 of 50
[...]
Scraping html page 36 of 50
Scraping html page 37 of 50
Fehler: 1: Memory allocation failed : growing buffer
2: Memory allocation failed : growing buffer
编辑2012-02-17:请帮助我验证计数器值
如果可以运行以下代码,您将对我有很大帮助. 您的时间不会超过2分钟. 您需要做的就是
Edit 2012-02-17: please help me verifying counter value
You'd do me a huge favor if you could run the following code. It won't take more than 2 minutes of your time. All you need to do is
- Download an Rdata file and save it as
seed.Rdata
. - Download the script containing my scraping function and save it as
scrape.R
. - Source the following code after setting the working directory accordingly.
代码:
setwd("set/path/to/your/wd")
install.packages("XML", repos="http://www.omegahat.org/R")
library(XML)
source("scrape.R")
load("seed.rdata")
html <- htmlParse(obj[1], asText = TRUE)
counter.1 <- .Call("R_getXMLRefCount", html)
print(counter.1)
z <- scrape(html)
gc()
gc()
counter.2 <- .Call("R_getXMLRefCount", html)
print(counter.2)
rm(html)
gc()
gc()
我对counter.1
和counter.2
的值特别感兴趣,这两个调用中应该应当为1
.实际上,邓肯已经在所有机器上对其进行了测试.但是,事实证明counter.2
在我所有的机器上都具有值259
(请参见上面的详细信息),而这正是导致我的问题的原因.
I'm particularly interested in the values of counter.1
and counter.2
which should be 1
in both calls. In fact, it is on all machines that Duncan has tested this on. However, as it turns out counter.2
has value 259
on all of my machines (see details above) and that's exactly what's causing my problem.
推荐答案
在XML
软件包的网页上,作者Duncan Temple Lang似乎已经相当广泛地描述了某些内存管理问题.参见本页:"XML包中的内存管理" .
From the XML
package's webpage, it seems that the author, Duncan Temple Lang, has quite extensively described certain memory management issues. See this page: "Memory Management in the XML Package".
老实说,我对代码和程序包的处理细节不熟练,但是我想您可以在该页面中找到答案,特别是在,或与Duncan Temple Lang直接通信.
Honestly, I'm not proficient in the details of what's going on here with your code and the package, but I think you'll either find the answer in that page, specifically in the section called "Problems", or in direct communication with Duncan Temple Lang.
更新1. 可能可行的想法是使用multicore
和foreach
软件包(即listResults = foreach(ix = 1:N) %dopar% {your processing;return(listElement)}
.我认为对于Windows,您将需要doSMP
,或者也许doRedis
;在Linux下,我使用doMC
.在任何情况下,通过并行化加载,您将获得更快的吞吐量.我认为您可以从内存使用中获得一些好处的原因是,它可能是分叉R ,可能会导致不同的内存清理,因为每个生成的进程在完成时都会被杀死.虽然不能保证正常工作,但是可以解决内存和速度问题.
Update 1. An idea that might work is to use the multicore
and foreach
packages (i.e. listResults = foreach(ix = 1:N) %dopar% {your processing;return(listElement)}
. I think that for Windows you'll need doSMP
, or maybe doRedis
; under Linux, I use doMC
. In any case, by parallelizing the loading, you'll get faster throughput. The reason I think you may get some benefit from memory usage is that it could be that forking R, could lead to different memory cleaning, as each spawned process gets killed when complete. This isn't guaranteed to work, but it could address both memory and speed issues.
但是请注意:doSMP
具有其自身的特质(即,您可能仍会遇到一些内存问题).关于SO的其他问答中也提到了一些问题,但我还是会尝试一下.
Note, though: doSMP
has its own idiosyncracies (i.e. you may still have some memory issues with it). There have been other Q&As on SO that mentioned some issues, but I'd still give it a shot.
这篇关于迭代解析XML文件时出现严重的内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!