Python - 读取文件并用分隔符分隔行的最佳方式 [英] Python - Best way to read a file and break out the lines by a delimeter

查看:2108
本文介绍了Python - 读取文件并用分隔符分隔行的最佳方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

读取文件并以分隔符分隔行的最佳方法是什么?
返回的数据应该是一个元组列表。

这个方法可以被打败吗?这可以做得更快/使用更少的内存吗?

$ $ p $ def读取文件(filepath,delim):
打开(文件路径,'r')作为f:
返回[tuple(line.split(delim))for line in f]


你发布的代码读取整个文件,并在内存中构建一个文件的副本,作为所有文件内容的单个列表,分成元组,每行一个元组。 。由于您询问如何使用更少的内存,您可能只需要一个生成器函数:

$ $ p $ $ $ c $ def readfile(filepath,delim):
打开(filepath,'r')作为f:
在f:
中产生元组(line.split(delim))

但是!有一个重要的警告!你只能遍历readfile返回的元组一次。

  lines_as_tuples = readfile(mydata,','):

为lines_as_tuples中的linedata:
#做某事

远,生成器和列表看起来一样。但是让我们说你的文件将包含大量的浮点数,并且通过文件的迭代计算出这些数字的整体平均值。您可以使用#做某事代码来计算总数和数字的数量,然后计算平均值。但是现在让我们假设你想再次迭代,这次要找到每个值的平均值的差异。你会认为你只是添加另一个for循环︰

 为linedata在lines_as_tuples:
#做另一件事
#但是 - 这个循环从来没有做任何事情,因为lines_as_tuples已被消耗!

BAM!这是生成器和列表之间的巨大差异。现在在代码中,生成器已经被完全消耗掉了,但是没有特别的例外,for循环完全不做任何事情,并且继续,默默无闻!

在许多情况下,你将得到的列表只会迭代一次,在这种情况下,将readfile转换为一个生成器就可以了。但是,如果你想要的是一个更持久的列表,你将访问多次,那么使用一个生成器会给你带来麻烦,因为你只能迭代一次生成器。



我的建议?使readline成为一个生成器,所以在它自己的世界的小角度看,它只是产生文件的每个增量位,很好,并且内存效率高。将数据保留的负担放到调用者上 - 如果调用者需要多次引用返回的数据,那么调用者可以简单地从生成者构建自己的列表 - 使用列表(readfile('file.dat',','))


What is the best way to read a file and break out the lines by a delimeter. Data returned should be a list of tuples.

Can this method be beaten? Can this be done faster/using less memory?

def readfile(filepath, delim):
    with open(filepath, 'r') as f:
        return [tuple(line.split(delim)) for line in f]

解决方案

Your posted code reads the entire file and builds a copy of the file in memory as a single list of all the file contents split into tuples, one tuple per line. Since you ask about how to use less memory, you may only need a generator function:

def readfile(filepath, delim): 
    with open(filepath, 'r') as f: 
        for line in f:
            yield tuple(line.split(delim))

BUT! There is a major caveat! You can only iterate over the tuples returned by readfile once.

lines_as_tuples = readfile(mydata,','):

for linedata in lines_as_tuples:
    # do something

This is okay so far, and a generator and a list look the same. But let's say your file was going to contain lots of floating point numbers, and your iteration through the file computed an overall average of those numbers. You could use the "# do something" code to calculate the overall sum and number of numbers, and then compute the average. But now let's say you wanted to iterate again, this time to find the differences from the average of each value. You'd think you'd just add another for loop:

for linedata in lines_as_tuples:
    # do another thing
    # BUT - this loop never does anything because lines_as_tuples has been consumed!

BAM! This is a big difference between generators and lists. At this point in the code now, the generator has been completely consumed - but there is no special exception raised, the for loop simply does nothing and continues on, silently!

In many cases, the list that you would get back is only iterated over once, in which case a conversion of readfile to a generator would be fine. But if what you want is a more persistent list, which you will access multiple times, then just using a generator will give you problems, since you can only iterate over a generator once.

My suggestion? Make readlines a generator, so that in its own little view of the world, it just yields each incremental bit of the file, nice and memory-efficient. Put the burden of retention of the data onto the caller - if the caller needs to refer to the returned data multiple times, then the caller can simply build its own list from the generator - easily done in Python using list(readfile('file.dat', ',')).

这篇关于Python - 读取文件并用分隔符分隔行的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆