性能 - Python与C＃/ C ++ / C阅读char-by-char [英] Performance - Python vs. C#/C++/C reading char-by-char

查看：233 发布时间：2016/11/18 16:53:56 c# python performance character

本文介绍了性能 - Python与C＃/ C ++ / C阅读char-by-char的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我有这些巨型的XML文件（和巨人，我的意思是1.5GB +），他们没有CRLF。我试图运行一个diff类程序来找到这些文件之间的差异。

由于我还没有找到一个diff程序，将不会爆炸内存耗尽，我决定最好的打赌是在关闭标签后添加CRLFs。

我写了一个python脚本来读取char-by-char， '>'后面的行。问题是我在一个单一的核心电脑上运行这个大约1995年或一些可笑的，只有处理大约20MB /小时，当我同时进行转换。

任何想法如果写在C＃/ C / C ++，而是会产生任何好处？如果没有，有没有人知道一个diff程序将逐字节？感谢。

编辑：

这里是我的处理函数的代码...

  def read_and_format（inputfile，outputfile）：
' '打开输入和输出文件，然后读取char-by-char并在>后添加新行''
 infile = codecs.open（inputfile，r，utf-8）
 outfile = codecs.open（outputfile，w，utf-8）
 
 char = infile.read（1）
 while（1）：
如果char ==：
 break 
 else：
 outfile。 write（char）
 if（char ==>）：
 outfile.write（\\\
）
 char = infile.read（1）
 
 infile.close（）
 outfile.close（）

EDIT2：
感谢您的回应。增加读取大小创造了令人难以置信的速度增加。问题解决。

解决方案

一次读取和写入一个字符几乎总是很慢，基于字符的设备，而不是基于字符的设备 - 它将读取的不仅仅是您所用的一个字节，而剩余的部分需要被丢弃。

尝试一次读取和写入更多，例如，8192字节（8KB），然后在写出之前在字符串中查找和添加换行符 - 你应该节省很多性能，因为需要少得多的I / O。 >

正如LBushkin指出的，您的I / O库可能正在进行缓冲，但除非有某种形式的文档显示这确实发生了（读和写），它是一个在用不同的语言重写之前尝试一个非常容易的事情。

So I have these giant XML files (and by giant, I mean like 1.5GB+) and they don't have CRLFs. I'm trying to run a diff-like program to find the differences between these files.

Since I've yet to find a diff program that won't explode due to memory exhaustion, I've decided the best bet was to add CRLFs after closing tags.

I wrote a python script to read char-by-char and add new-lines after '>'. The problem is I'm running this on a single core PC circa-1995 or something ridiculous, and it's only processing about 20MB/hour when I have both converting at the same time.

Any idea if writing this in C#/C/C++ instead will yield any benefits? If not, does anyone know of a diff program that will go byte-by-byte? Thanks.

EDIT:

Here's the code for my processing function...

def read_and_format(inputfile, outputfile):
    ''' Open input and output files, then read char-by-char and add new lines after ">" '''
    infile = codecs.open(inputfile,"r","utf-8")
    outfile = codecs.open(outputfile,"w","utf-8")

    char = infile.read(1) 
    while(1):
        if char == "":
            break
        else:
            outfile.write(char)
            if(char == ">"):
                outfile.write("\n")
        char = infile.read(1)

    infile.close()
    outfile.close()

EDIT2: Thanks for the awesome responses. Increaseing the read size created an unbelievable speed increase. Problem solved.

解决方案

Reading and writing a single character at a time is almost always going to be slow, because disks are block-based devices, rather than character-based devices - it will read a lot more than just the one byte you're after, and the surplus parts need to be discarded.

Try reading and writing more at a time, say, 8192 bytes (8KB) and then finding and adding newlines in that string before writing it out - you should save a lot in performance because a lot less I/O is required.

As LBushkin points out, your I/O library may be doing buffering, but unless there is some form of documentation that shows this does indeed happen (for reading AND writing), it's a fairly easy thing to try before rewriting in a different language.

这篇关于性能 - Python与C＃/ C ++ / C阅读char-by-char的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

性能 - Python与C＃/ C ++ / C阅读char-by-char [英] Performance - Python vs. C#/C++/C reading char-by-char

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

性能 - Python与C＃/ C ++ / C阅读char-by-char [英] Performance - Python vs. C#/C++/C reading char-by-char

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭