编写大型的基于XML的日志文件时性能不佳 [英] Poor performance when writing large XML-based log file

查看:57
本文介绍了编写大型的基于XML的日志文件时性能不佳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Python 比较陌生,我已经为定义的应用程序列表编写了一个相当简单的脚本记录性能统计信息.该脚本按一定间隔对进程进行采样(使用 psutil )并返回各种统计信息,然后将其记录下来.为了便于以后对数据进行有趣的操作,我使用了XML日志格式.

下面是日志结构的简化版本:

 <?xml version ="1.0"?><数据><定期>< sample name ="2015-02-25_23-22-54">< cpu app ="safari"> 10.5</cpu>< memory app ="safari"> 1024</memory>< disk app ="safari"> 60</disk>< network app ="safari"> 720</network></sample></句点></data> 

我当前正在使用 cElementTree 来解析和创建日志文件.采样循环的每次迭代都会解析现有的日志文件,将最新数据附加到末尾,然后将新文件写入磁盘.

我的日志编写器类的简化版本:

 将xml.etree.cElementTree导入为etree从xml.dom导入minidom日志文件='路径/到/logfile.xml'类WriteXmlLog:#解析日志文件.def __init __():self.root = etree.parse(logfile).getroot()self.periodic = list(self.root.iter('periodic'))[0]def __write_out(自己,log_file):"将日志内容写入文件.""打开(log_file,'w').write(minidom.parseString(etree.tostring(self.root).replace('\ n','').replace('\ t','')).toprettyxml())def __create_timestamp(self):"返回用于命名过程样本迭代的时间戳.""返回datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d_%H-%M-%S')def write_sample(self,sample_list):"创建示例或将示例附加到XML日志文件.""node_sample_time = etree.Element('样本')node_sample_time.set('time',self .__ create_timestamp())对于我在sample_list中:app_dict = i.get('值')对于app_dict中的:样本= etree.Element(a)app = str(i.get('appname')).lower()sample.set('app', app)sample.text = app_dict [a]node_sample_time.append(样本)self.periodic.append(node_sample_time)self .__ write_out(日志文件) 

我遇到的问题是,如果日志文件很小,则此脚本可以正常工作,但在某些情况下,我们必须每隔几秒钟对同一进程进行一次采样,有时需要连续运行数天,因此使用了该脚本.这样最多可以生成10 MB的日志文件(此时将其旋转).在这种大小的日志上运行脚本大约需要15秒,并且在整个过程中固定1个CPU内核,更不用说过多的内存使用和磁盘I/O.

__ write_out()可能不是很有效,因为它运行两个搜索和替换操作(以去除使 toprettyxml 混乱的多余的换行符和制表符),然后发送整个通过每次迭代的最小化输出.这样做是因为 cElementTree 不会单独缩进节点,从而使生成的文件比人类可读的少.但是,真正的问题似乎仅仅是,每次迭代都无法解析和写入整个日志,这本来就无法扩展.

我的第一个想法是完全完全放弃使用 cElementTree ,手动"将结果格式化为XML字符串,然后在每次迭代时将其附加到日志文件的末尾(无需解析现有文件)完全没有).这种方法的问题在于,由于根节点没有结束标记,因此生成的文件将不是有效的XML.我可以让记录器在完成时写一个(当前设计为无限循环,直到 SIGTERM ,然后在退出时进行一些清理),但理想情况下,我希望日志文件在记录期间始终是有效的XML.似乎也显得笨拙.

摘要:什么是写入具有良好性能和合理资源使用量(可扩展到大约10 MB的日志文件大小)的基于XML的日志文件的最佳方法?

解决方案

如果我理解此权利,则可以将每个定期"元素都视为一个完整的文档来创建(因此,您仍可以使用cElementTree或类似元素;或只是手动将其创建为字符串).

然后是时候写出这样一个(小)元素了,打开日志文件,并寻找结尾减去</data>"的长度(7).编写新的周期元素,然后重新编写</data>",就可以了.

如果要格外小心,请移至末尾,阅读最后7个字符以确保它们符合预期,然后再次尝试将文件放置在它们之前.

I'm relatively new to Python, and I've written a fairly simple script logging performance statistics for a defined list of applications. The script samples processes at intervals (using psutil) and returns various stats, which are then logged. To make it easier to do interesting things with the data later, I'm using an XML log format.

Below is a much simplified version of the log structure:

<?xml version="1.0" ?>
<data>
    <periodic>
        <sample name="2015-02-25_23-22-54">
            <cpu app="safari">10.5</cpu>
            <memory app="safari">1024</memory>
            <disk app="safari">60</disk>
            <network app="safari">720</network>
        </sample>
    </periodic>
</data>

I'm currently using cElementTree to parse and create the log file. Each iteration of the sampling loop parses the existing log file, appends the latest data to the end, then writes the new file to disk.

Simplified version of my log writer class:

import xml.etree.cElementTree as etree
from xml.dom import minidom

logfile = 'path/to/logfile.xml'

class WriteXmlLog:
    # Parse the logfile.
    def __init__(self):
        self.root = etree.parse(logfile).getroot()
        self.periodic = list(self.root.iter('periodic'))[0]

    def __write_out(self, log_file):
        """Write log contents to file."""
        open(log_file, 'w').write(minidom.parseString(etree.tostring(self.root).replace('\n', '').replace('\t', '')).toprettyxml())

    def __create_timestamp(self):
        """Returns a timestamp for naming a process sample iteration."""
        return datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d_%H-%M-%S')

    def write_sample(self, sample_list):
        """Create or append sample to XML log file."""
        node_sample_time = etree.Element('sample')
        node_sample_time.set('time', self.__create_timestamp())
        for i in sample_list:
            app_dict = i.get('values')
            for a in app_dict:
                sample = etree.Element(a)
                app = str(i.get('appname')).lower()
                sample.set('app', app)
                sample.text = app_dict[a]
                node_sample_time.append(sample)
        self.periodic.append(node_sample_time)
        self.__write_out(logfile)

The problem I'm having is that while this script works just fine if the log file is small, it's being used in instances where we have to sample the same processes every few seconds, sometimes for several days running. This can generate log files up to 10 MB (at which point they are rotated) in size. Running the script on a log this size takes about 15 seconds, and pegs 1 CPU core for the whole duration, not to mention excessive memory usage and disk I/O.

__write_out() is probably not very efficient since it runs two search and replace operations (to strip extraneous newlines and tabs that mess up toprettyxml), then sends the whole output through minidom on every iteration. This is done since cElementTree doesn't indent the nodes on its own, making the resulting file less than human-readable. However, the real problem seems to be simply that parsing and writing the entire log every iteration is inherently unscalable.

My first thought was to simply forego using cElementTree entirely, "manually" format the results as an XML string, then append them to the end of the log file every iteration (without parsing the existing file at all). The problem with this approach is that the resulting file will not be valid XML since the root node won't have a closing tag. I can have the logger write one when it's finished (it's currently designed to loop infinitely until SIGTERM, then do some cleanup on exit) but I would ideally like the log file to always be valid XML during logging. It also just seems clumsy somehow.

Summary: What's the best way to write to an XML-based log file with good performance and reasonable resource usage that will scale to a log file size of approximately 10 MB?

解决方案

If I'm understanding this right, you could create each "periodic" element as if it were an entire document (so you could still use cElementTree or similar; or just create it manually as a string).

Then when it's time to write out such a (small) element, open your log file, and seek to the end minus the length of "</data>" (7). Write the new periodic element, then re-write "</data>", and you should be fine.

If you want to be extra careful, after moving to the end, read the last 7 characters to make sure they're as expected, then seek again to position the file before them again.

这篇关于编写大型的基于XML的日志文件时性能不佳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆