迭代解析XML文件时出现严重的内存泄漏 [英] Serious Memory Leak When Iteratively Parsing XML Files

查看:142
本文介绍了迭代解析XML文件时出现严重的内存泄漏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

遍历一组加载的Rdata文件(每个文件都包含HTML代码的字符向量)时,将进行分析(通过

When iterating over a set of Rdata files (each containing a character vector of HTML code) that are loaded, analyzed (via XML functionality) and then removed from memory again, I experience a significant increase in an R process' memory consumption (killing the process eventually).

好像

  • 通过free()
  • 释放对象
  • 通过rm()
  • 删除它们
  • 运行gc()
  • freeing objects via free(),
  • removing them via rm() and
  • running gc()

没有任何影响,因此内存消耗会累积,直到没有剩余的内存为止.

do not have any effects, so the memory consumption cumulates until there's no more memory left.

感谢包> 的作者和维护者分享的宝贵见解> XML ,Duncan Temple Lang(再次:我非常感谢!),这个问题似乎与释放外部指针的方式以及XML包中垃圾回收的处理方式密切相关. . Duncan发布了该程序包的一个错误修复版本(3.92-0),该版本合并了解析XML和HTML的某些方面,并具有改进的垃圾回收功能,不再需要通过free()显式释放包含外部指针的对象.您可以在Duncan的 Omegahat网站中找到源代码和Windows二进制文件.

Thanks to valuable insight shared by the author and maintainer of package XML, Duncan Temple Lang (again: I really appreciate it very much!), the problem seems to be closely related to the way external pointers are freed and how garbage collection is handled in the XML package. Duncan issued a bug-fixed version of the package (3.92-0) that consolidated certain aspects of parsing XML and HTML and features an improved garbage collection where it's not necessary anymore to explicitly free the object containing the external pointer via free(). You find the source code and a Windows binary at Duncan's Omegahat website.

不幸的是,新的软件包版本似乎仍无法解决我整理的一个小例子中遇到的问题.我遵循了一些建议,并简化了示例,使其更易于掌握并查找可能出现问题的相关功能(请检查功能./lib/exampleRun.R.lib/scrape.R).

Unfortunately, the new package version still does not seem to fix the issues I'm encountering in the little little example that I've put together. I followed some suggestion and simplified the example a bit, making it easier to grasp and to find the relevant functions where things seem to go wrong (check functions ./lib/exampleRun.R and .lib/scrape.R).

邓肯建议尝试通过.Call("RS_XML_forceFreeDoc", html)显式释放已解析的文档.我在示例中包含了一个逻辑开关(脚本./scripts/memory.R中的do.forcefree),如果将其设置为TRUE,它将执行此操作.不幸的是,这使我的R控制台崩溃了.如果有人可以在自己的计算机上验证这一点,那就太好了!实际上,使用最新版本的XML(请参见上文)时,文档应该被自动释放.事实并非如此(根据邓肯的说法).

Duncan suggested trying to force to free the parsed document explicitly via .Call("RS_XML_forceFreeDoc", html). I've included a logical switch in the example (do.forcefree in script ./scripts/memory.R) that, if set to TRUE, will do just that. Unfortunately, this made my R console crash. It'd be great if someone could verify this on their machine! Actually, the doc should be freed automatically when using the latest version of XML (see above). The fact that it isn't seems to be a bug (according to Duncan).

Duncan向他的Omegahat网站 Omegahat网站推出了另一版本的XML(3.92-1).总体而言,这应该可以解决该问题.但是,我的示例似乎不走运,因为我仍然遇到相同的内存泄漏.

Duncan pushed yet another version of XML (3.92-1) to his Omegahat website Omegahat website. This should fix the issue in general. However, I seem to be out of luck with my example as I still experience the same memory leakage.

是的! Duncan发现并修复了该错误!这是纯Windows脚本中的一个小错字,它解释了为什么该错误在Linux,Mac OS等系统中未显示.请查看最新版本 3.92-2.!现在,内存消耗与迭代解析和处理XML文件时一样恒定!

YES! Duncan found and fixed the bug! It was a little typo in a Windows-only script which explained why the bug didn't show in Linux, Mac OS etc. Check out the latest version 3.92-2.! Memory consumption is now as constant as can be when iteratively parsing and processing XML files!

再次特别感谢Duncan Temple Lang,并感谢其他回答此问题的人!

Special thanks again to Duncan Temple Lang and thanks to everyone else that responded to this question!

  1. 从我的 Github存储库 .
  2. 打开脚本./scripts/memory.R并在a)第6行中设置您的工作目录,b)在中设置示例范围 >第16行,以及c)是否在第22行上强制释放已解析的文档.请注意,您仍然可以找到旧的脚本.它们在文件名末尾带有" LEGACY "标签.
  3. 运行脚本.
  4. 研究最新文件./memory_<TIMESTAMP>.txt,以查看记录的内存状态随时间的增加.我包括了两个文本文件,这些文件是我自己的测试运行产生的.
  1. Download folder 'memory' from my Github repo.
  2. Open up the script ./scripts/memory.R and set a) your working directory at line 6, b) the example scope at line 16 as well c) whether to force the freeing of the parsed doc or not at line 22. Note that you can still find the old scripts; they are "tagged" by an "LEGACY" at the end of the filename.
  3. Run the script.
  4. Investigate the latest file ./memory_<TIMESTAMP>.txt to see the increase in logged memory states over time. I've included two text files that resulted from my own test runs.

我在内存控制方面所做的事情

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆