“用于文件对象中的行";读取文件的方法 [英] "for line in file object" method to read files

查看:89
本文介绍了“用于文件对象中的行";读取文件的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出读取/处理超大文件行的最佳方法. 在这里,我只是尝试

I'm trying to find out the best way to read/process lines for super large file. Here I just try

for line in f:

我的脚本的一部分如下:

Part of my script is as below:

o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
   if i%4!=3:
      LIST.append(line)

   else:
      LIST.append(line)
      b1=[ord(x) for x in line]
      ave1=(sum(b1)-10)/float(len(line)-1)
      if (ave1 < 84):
         del LIST[-4:]
output1=o.writelines(LIST)

我的file1约为10GB;当我运行脚本时,内存使用量一直增加到15GB,没有任何输出.这意味着计算机仍在尝试首先将整个文件读取到内存中,对吗?这确实与使用readlines()

My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()

但是在帖子中: 在python中读取大数据的不同方法 斯里卡告诉我: The for line in f将文件对象f视为可迭代对象,它自动使用缓冲的IO和内存管理,因此您不必担心大文件.

However in the post: Different ways to read large data in python Srika told me: The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

但是显然我仍然需要担心大文件..我真的很困惑.

But obviously I still need to worry large files..I'm really confused. thx

每4行是我数据中的一种分组. 目的是在第4行上进行一些计算.然后根据计算结果,决定是否需要添加这4行.所以写行是我的目的.

edit: Every 4 lines is kind of group in my data. THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.

推荐答案

在此函数的结尾,您将已读取的所有行都放入了内存,然后立即将它们写到文件中.也许您可以尝试以下过程:

It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:

  1. 将所需的行读入内存(前3行).
  2. 在第四行上,将&执行您的计算.
  3. 如果您要查找的是计算内容,请将集合中的值刷新到文件中.
  4. 无论执行什么操作,都要创建一个新的集合实例.

我还没有尝试过,但是它可能看起来像这样:

I haven't tried this out, but it could maybe look something like this:

o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]

for i,line in enumerate(f):
   if i % 4 != 3:
      LIST.append(line)
   else:
      LIST.append(line)
      b1 = [ord(x) for x in line]
      ave1 = (sum(b1) - 10) / float(len(line) - 1

      # If we've found what we want, save them to the file
      if (ave1 >= 84):
         o.writelines(LIST)

      # Release the values in the list by starting a clean list to work with
      LIST = []

尽管如此,由于您的文件太大,由于您必须将所有行写入文件,因此这可能不是最好的技术,但是无论如何它都值得研究.

As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.

这篇关于“用于文件对象中的行";读取文件的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆