在Python中移动到文件中的任意位置 [英] Moving to an arbitrary position in a file in Python

查看:79
本文介绍了在Python中移动到文件中的任意位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我通常必须处理行数未知但数量很多的文件.每行在闭合间隔[0,R]中包含一组整数(空格,逗号,分号或某些非数字字符是定界符),其中R可以任意大.每行的整数数量可以是可变的.通常,我在每一行上都得到相同数量的整数,但是偶尔我的行中的数字集不相等.

假设我想转到文件中的第N行并检索该行的第K个数字(并假设输入N和K是有效的,即我不担心输入错误).如何在适用于Windows的Python 3.1.2中有效地做到这一点?

我不想逐行遍历文件.

我尝试使用mmap,但在SO上四处寻找时,我了解到由于4GB的限制,这可能不是32位版本上的最佳解决方案.实际上,我无法真正弄清楚如何仅将N条线从当前位置移开.如果我至少可以跳转"到第N行,则可以使用.split()并以这种方式获取第K个整数.

这里的细微差别是,我不仅需要从文件中抓取一行.我将需要抓住几条线:它们不一定彼此靠近,我得到它们的顺序很重要,并且该顺序并不总是基于某些确定性函数.

有什么想法吗?我希望这是足够的信息.

谢谢!

解决方案

Python的seek转到文件中的 byte 偏移量,而不是 line 偏移量,仅仅是因为这就是现代操作系统及其文件系统的工作方式-OS/FS根本不以任何方式记录或记住行偏移",并且Python(或任何其他语言)无法魔术般地猜测他们.任何试图进入一行"的操作都将不可避免地需要遍历文件"(在幕后)以使行号和字节偏移量之间具有关联.

如果您对此感到满意,并且只是想将其隐藏起来,那么解决方案是标准库模块

编辑:因为显然这可能适用-这是一般思路(经过仔细的测试,错误检查或优化;-).要创建索引,请使用makeindex.py,如下所示:

import array
import sys

BLOCKSIZE = 1024 * 1024

def reader(f):
  blockstart = 0
  while True:
    block = f.read(BLOCKSIZE)
    if not block: break
    inblock = 0
    while True:
      nextnl = block.find(b'\n', inblock)
      if nextnl < 0:
        blockstart += len(block)
        break
      yield nextnl + blockstart
      inblock = nextnl + 1

def doindex(fn):
  with open(fn, 'rb') as f:
    # result format: x[0] is tot # of lines,
    # x[N] is byte offset of END of line N (1+)
    result = array.array('L', [0])
    result.extend(reader(f))
    result[0] = len(result) - 1
    return result

def main():
  for fn in sys.argv[1:]:
    index = doindex(fn)
    with open(fn + '.indx', 'wb') as p:
      print('File', fn, 'has', index[0], 'lines')
      index.tofile(p)

main()

,然后使用它,例如,以下useindex.py:

import array
import sys

def readline(n, f, findex):
  f.seek(findex[n] + 1)
  bytes = f.read(findex[n+1] - findex[n])
  return bytes.decode('utf8')

def main():
  fn = sys.argv[1]
  with open(fn + '.indx', 'rb') as f:
    findex = array.array('l')
    findex.fromfile(f, 1)
    findex.fromfile(f, findex[0])
    findex[0] = -1
  with open(fn, 'rb') as f:
    for n in sys.argv[2:]:
      print(n, repr(readline(int(n), f, findex)))

main()

这是一个例子(在我的慢速笔记本电脑上):

$ time py3 makeindex.py kjv10.txt 
File kjv10.txt has 100117 lines

real    0m0.235s
user    0m0.184s
sys 0m0.035s
$ time py3 useindex.py kjv10.txt 12345 98765 33448
12345 '\r\n'
98765 '2:6 But this thou hast, that thou hatest the deeds of the\r\n'
33448 'the priest appointed officers over the house of the LORD.\r\n'

real    0m0.049s
user    0m0.028s
sys 0m0.020s
$ 

该示例文件是詹姆士国王圣经的纯文本文件:

$ wc kjv10.txt
100117  823156 4445260 kjv10.txt

如您所见,

10万行,4.4 MB;这需要大约四分之一秒才能建立索引,需要50毫秒才能读取和打印出3条随机y行(毫无疑问,通过更仔细的优化和更好的机器,这可以大大加快速度).内存中(和磁盘上)的索引每行要索引的文本文件占用4个字节,并且性能应该以完全线性的方式扩展,因此,如果您有大约1亿行,4.4 GB,我希望大约4-5分钟来建立索引,一分钟来提取并打印出任意三行(并且即使是一台小型机器,索引所占用的400 MB RAM也不会给您带来麻烦-甚至我的小巧的笔记本电脑毕竟还有2GB ;-). /p>

您还可以看到(出于速度和便利性考虑)我将该文件视为二进制文件(并假定为utf8编码-当然也适用于ASCII之类的任何子集,例如KJ文本文件为ASCII)并且不必理会如果文件将其作为行终止符,则将\r\n折叠为单个字符(如果需要,在 读取每一行之后执行这很简单).

Let's say that I routinely have to work with files with an unknown, but large, number of lines. Each line contains a set of integers (space, comma, semicolon, or some non-numeric character is the delimiter) in the closed interval [0, R], where R can be arbitrarily large. The number of integers on each line can be variable. Often times I get the same number of integers on each line, but occasionally I have lines with unequal sets of numbers.

Suppose I want to go to Nth line in the file and retrieve the Kth number on that line (and assume that the inputs N and K are valid --- that is, I am not worried about bad inputs). How do I go about doing this efficiently in Python 3.1.2 for Windows?

I do not want to traverse the file line by line.

I tried using mmap, but while poking around here on SO, I learned that that's probably not the best solution on a 32-bit build because of the 4GB limit. And in truth, I couldn't really figure out how to simply move N lines away from my current position. If I can at least just "jump" to the Nth line then I can use .split() and grab the Kth integer that way.

The nuance here is that I don't just need to grab one line from the file. I will need to grab several lines: they are not necessarily all near each other, the order in which I get them matters, and the order is not always based on some deterministic function.

Any ideas? I hope this is enough information.

Thanks!

解决方案

Python's seek goes to a byte offset in a file, not to a line offset, simply because that's the way modern operating systems and their filesystems work -- the OS/FS just don't record or remember "line offsets" in any way whatsoever, and there's no way for Python (or any other language) to just magically guess them. Any operation purporting to "go to a line" will inevitably need to "walk through the file" (under the covers) to make the association between line numbers and byte offsets.

If you're OK with that and just want it hidden from your sight, then the solution is the standard library module linecache -- but performance won't be any better than that of code you could write yourself.

If you need to read from the same large file multiple times, a large optimization would be to run once on that large file a script that builds and saves to disk the line number - to - byte offset correspondence (technically an "index" auxiliary file); then, all your successive runs (until the large file changes) could very speedily use the index file to navigate with very high performance through the large file. Is this your use case...?

Edit: since apparently this may apply -- here's the general idea (net of careful testing, error checking, or optimization;-). To make the index, use makeindex.py, as follows:

import array
import sys

BLOCKSIZE = 1024 * 1024

def reader(f):
  blockstart = 0
  while True:
    block = f.read(BLOCKSIZE)
    if not block: break
    inblock = 0
    while True:
      nextnl = block.find(b'\n', inblock)
      if nextnl < 0:
        blockstart += len(block)
        break
      yield nextnl + blockstart
      inblock = nextnl + 1

def doindex(fn):
  with open(fn, 'rb') as f:
    # result format: x[0] is tot # of lines,
    # x[N] is byte offset of END of line N (1+)
    result = array.array('L', [0])
    result.extend(reader(f))
    result[0] = len(result) - 1
    return result

def main():
  for fn in sys.argv[1:]:
    index = doindex(fn)
    with open(fn + '.indx', 'wb') as p:
      print('File', fn, 'has', index[0], 'lines')
      index.tofile(p)

main()

and then to use it, for example, the following useindex.py:

import array
import sys

def readline(n, f, findex):
  f.seek(findex[n] + 1)
  bytes = f.read(findex[n+1] - findex[n])
  return bytes.decode('utf8')

def main():
  fn = sys.argv[1]
  with open(fn + '.indx', 'rb') as f:
    findex = array.array('l')
    findex.fromfile(f, 1)
    findex.fromfile(f, findex[0])
    findex[0] = -1
  with open(fn, 'rb') as f:
    for n in sys.argv[2:]:
      print(n, repr(readline(int(n), f, findex)))

main()

Here's an example (on my slow laptop):

$ time py3 makeindex.py kjv10.txt 
File kjv10.txt has 100117 lines

real    0m0.235s
user    0m0.184s
sys 0m0.035s
$ time py3 useindex.py kjv10.txt 12345 98765 33448
12345 '\r\n'
98765 '2:6 But this thou hast, that thou hatest the deeds of the\r\n'
33448 'the priest appointed officers over the house of the LORD.\r\n'

real    0m0.049s
user    0m0.028s
sys 0m0.020s
$ 

The sample file is a plain text file of King James' Bible:

$ wc kjv10.txt
100117  823156 4445260 kjv10.txt

100K lines, 4.4 MB, as you can see; this takes about a quarter second to index and 50 milliseconds to read and print out three random-y lines (no doubt this can be vastly accelerated with more careful optimization and a better machine). The index in memory (and on disk too) takes 4 bytes per line of the textfile being indexed, and performance should scale in a perfectly linear way, so if you had about 100 million lines, 4.4 GB, I would expect about 4-5 minutes to build the index, a minute to extract and print out three arbitrary lines (and the 400 MB of RAM taken for the index should not inconvenience even a small machine -- even my tiny slow laptop has 2GB after all;-).

You can also see that (for speed and convenience) I treat the file as binary (and assume utf8 encoding -- works with any subset like ASCII too of course, eg that KJ text file is ASCII) and don't bother collapsing \r\n into a single character if that's what the file has as line terminator (it's pretty trivial to do that after reading each line if you want).

这篇关于在Python中移动到文件中的任意位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆