如何读取令牌而不读取整行或文件 [英] How to read tokens without reading whole line or file

查看:149
本文介绍了如何读取令牌而不读取整行或文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种隐藏的方式来读取文件或类似文件的对象而不用读取整行?我立即拥有的应用程序(别人的问题,而不是我的)正在转换一个有很长行的大矩阵,基本上在迭代器上执行 itertools.izip()列出单列的元素。这个想法在迭代过程中并不是没有将整个文件存储在内存中。

这些行是空格分隔的ASCII十进制数字。



Java的Scanner类的问题很简单,但是在Python标准库中看不到任何字符串中没有整个输入的东西。



为了记录,我知道如何自己写这个。我只是想知道是否有一个标准的工具,我错过了。一些可以EasyInstalled的FOSS / libre也不错,但我在PYPI上也看不到任何东西。



完整的问题是采样输入: / p>

 123 3 234234 -35434 112312 54 -439 99 0 42\\\
+
13 456 -78 910 333 -44 5555 6 8

...并产生输出(作为一个发生器,没有立即将所有非常长的行读入内存中:

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $' -78],... etc

正如我所说,本质上是itertools.izip(iterator1,iterator2 ),将iterator1指向文件的开头,iterator2刚刚通过换行符来读取第二行。 阅读您可以使用 re 模块来从

<$ p $
im端口re
从mmap导入sys
导入ACCESS_READ,mmap
$ b $ def generate_tokens(文件名,模式):
打开(文件名)为f,mmap(f。 fileno(),0,access = ACCESS_READ)mm:
从re.finditer(pattern,mm)得出

#在命令行中指定的文件中包含所有整数
print(sum(int(m.group())for generate_tokens(sys.argv [1],br'\d +')))

即使文件不适合内存,它也可以工作。

Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else's problem, not mine) is transposing a large matrix with a few very long rows, essentially performing an itertools.izip() on iterators that pick out the elements of a single column. The idea is not not have the entire file in memory during iteration.

The rows are space-delimited ASCII decimal numbers.

The problem would be simple with Java's Scanner class, but I don't see anything in the Python Standard Library that appears to tokenize without having the whole input in a string.

For the record, I know how to write this on my own. I'm just wondering if there's a standard tool that I missed. Something FOSS/libre that can be EasyInstalled is good, too, but I don't see anything on PYPI either.

The full problem was to take the sample input:

"123 3 234234 -35434 112312 54 -439 99 0 42\n" +
"13 456 -78 910 333 -44 5555 6 8"

...and produce the output (as a generator, without reading all of very long rows into memory at once:

[123, 13], [3, 456], [234234, -78], ...etc

As I said, it's essentially itertools.izip(iterator1, iterator2), pointing iterator1 at the start of the file, and iterator2 just past the newline to read the second row.

解决方案

To read tokens from a file one by one; you could use re module to generate tokens from a memory-mapped file:

#!/usr/bin/env python3
import re
import sys
from mmap import ACCESS_READ, mmap    

def generate_tokens(filename, pattern):
    with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
         yield from re.finditer(pattern, mm)

# sum all integers in a file specified at the command-line
print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'\d+')))

It works even if the file doesn't fit in memory.

这篇关于如何读取令牌而不读取整行或文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆