性能 - Python与C#/ C ++ / C阅读char-by-char [英] Performance - Python vs. C#/C++/C reading char-by-char

查看:233
本文介绍了性能 - Python与C#/ C ++ / C阅读char-by-char的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有这些巨型的XML文件(和巨人,我的意思是1.5GB +),他们没有CRLF。我试图运行一个diff类程序来找到这些文件之间的差异。



由于我还没有找到一个diff程序,将不会爆炸内存耗尽,我决定最好的打赌是在关闭标签后添加CRLFs。



我写了一个python脚本来读取char-by-char, '>'后面的行。问题是我在一个单一的核心电脑上运行这个大约1995年或一些可笑的,只有处理大约20MB /小时,当我同时进行转换。



任何想法如果写在C#/ C / C ++,而是会产生任何好处?如果没有,有没有人知道一个diff程序将逐字节?感谢。






编辑



这里是我的处理函数的代码...

  def read_and_format(inputfile,outputfile):
' '打开输入和输出文件,然后读取char-by-char并在>后添加新行''
infile = codecs.open(inputfile,r,utf-8)
outfile = codecs.open(outputfile,w,utf-8)

char = infile.read(1)
while(1):
如果char ==:
break
else:
outfile。 write(char)
if(char ==>):
outfile.write(\\\

char = infile.read(1)

infile.close()
outfile.close()






EDIT2:
感谢您的回应。增加读取大小创造了令人难以置信的速度增加。问题解决。

解决方案

一次读取和写入一个字符几乎总是很慢,基于字符的设备,而不是基于字符的设备 - 它将读取的不仅仅是您所用的一个字节,而剩余的部分需要被丢弃。



尝试一次读取和写入更多,例如,8192字节(8KB),然后在写出之前在字符串中查找和添加换行符 - 你应该节省很多性能,因为需要少得多的I / O。 >

正如LBushkin指出的,您的I / O库可能正在进行缓冲,但除非有某种形式的文档显示这确实发生了(读和写),它是一个在用不同的语言重写之前尝试一个非常容易的事情。


So I have these giant XML files (and by giant, I mean like 1.5GB+) and they don't have CRLFs. I'm trying to run a diff-like program to find the differences between these files.

Since I've yet to find a diff program that won't explode due to memory exhaustion, I've decided the best bet was to add CRLFs after closing tags.

I wrote a python script to read char-by-char and add new-lines after '>'. The problem is I'm running this on a single core PC circa-1995 or something ridiculous, and it's only processing about 20MB/hour when I have both converting at the same time.

Any idea if writing this in C#/C/C++ instead will yield any benefits? If not, does anyone know of a diff program that will go byte-by-byte? Thanks.


EDIT:

Here's the code for my processing function...

def read_and_format(inputfile, outputfile):
    ''' Open input and output files, then read char-by-char and add new lines after ">" '''
    infile = codecs.open(inputfile,"r","utf-8")
    outfile = codecs.open(outputfile,"w","utf-8")

    char = infile.read(1) 
    while(1):
        if char == "":
            break
        else:
            outfile.write(char)
            if(char == ">"):
                outfile.write("\n")
        char = infile.read(1)

    infile.close()
    outfile.close()


EDIT2: Thanks for the awesome responses. Increaseing the read size created an unbelievable speed increase. Problem solved.

解决方案

Reading and writing a single character at a time is almost always going to be slow, because disks are block-based devices, rather than character-based devices - it will read a lot more than just the one byte you're after, and the surplus parts need to be discarded.

Try reading and writing more at a time, say, 8192 bytes (8KB) and then finding and adding newlines in that string before writing it out - you should save a lot in performance because a lot less I/O is required.

As LBushkin points out, your I/O library may be doing buffering, but unless there is some form of documentation that shows this does indeed happen (for reading AND writing), it's a fairly easy thing to try before rewriting in a different language.

这篇关于性能 - Python与C#/ C ++ / C阅读char-by-char的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆