在python中读取大数据的不同方法 [英] Different ways to read large data in python

查看:199
本文介绍了在python中读取大数据的不同方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理大数据,因此找到一种读取数据的好方法真的很重要. 我只是对不同的阅读方法感到困惑.

I'm dealing with large data, so finding a good way for reading data is really important. I'm just a little bit confused about different reading methods.

1.f=gzip.open(file,'r')
      for line in f:
          process line
     #how can I process nth line? can I?
2.f=gzip.open(file,'r').readlines()
  #f is a list
  f[10000]
  #we can process nth line

3.f=gzip.open(file,'r')
  while True:
       linelist=list(islice(f,4))

4.for line in fileinput.input():
  process line

2和3有什么区别?我只是发现它们的内存使用情况是相同的. islice()还需要先将整个文件加载到内存中(但稍后需要一点一点地加载). 我听说第四种方法消耗的内存最少,它确实在一点一点地处理,对吗? 对于10GB大小的文件,您建议使用哪种文件读取方法?欢迎任何想法/信息.

What's the difference between 2 and 3 ? I just find their memory usage is the same. islice() also needs to first load the whole file into memory (but just later take bit by bit). And I hear the 4th method is the least memory-consuming, it's really processing bit by bit, right? For 10GB-scale file, which file-reading method would you recommend? Any thought/information is welcomed. thx

edit:我认为我的问题之一是有时需要随机选择特定的行. 说:

edit: I think one of my problem is I need to pick out specific lines randomly sometimes. say:

f1=open(inputfile1, 'r')
while True:
    line_group1 = list(islice(f1, 3))
    if not line_group1:
        break
    #then process specific lines say, the second line.
    processed 2nd line
    if ( ....):
           LIST1.append(line_group1[0])
           LIST1.append(processed 2nd line)
           LIST1.append(line_group1[2])

然后…….像

with open(file,'r') as f,
    for line in f:
       # process line

可能不起作用,我正确吗?

may not work, am I correct?

推荐答案

您忘记了-

with open(...) as f:
    for line in f:
        <do something with line>

with语句处理文件的打开和关闭,包括内部块是否引发异常. for line in f将文件对象f视为可迭代对象,它会自动使用缓冲的IO和内存管理,因此您不必担心大文件.

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

建议不要将2,3都用于大文件,因为它们读取&在开始处理之前,将整个文件内容加载到内存中.要读取大文件,您需要找到一种方法,而不是一次读取整个文件.

Both 2,3 are not advised for large files as they read & load the entire file contents in memory before processing starts. To read large files you need to find ways to not read the entire file in one single go.

应该有一种-最好只有一种-显而易见的方法.

There should be one -- and preferably only one -- obvious way to do it.

这篇关于在python中读取大数据的不同方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆