如何在不读取整行或文件的情况下读取令牌 [英] How to read tokens without reading whole line or file

查看:25
本文介绍了如何在不读取整行或文件的情况下读取令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种隐藏的方式可以从文件或类似文件的对象中读取标记而无需读取整行?我立即拥有的应用程序(其他人的问题,不是我的问题)正在转置一个包含几行很长行的大矩阵,本质上是在迭代器上执行 itertools.izip() 以挑选出单个元素的元素柱子.这个想法不是在迭代过程中将整个文件保存在内存中.

Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else's problem, not mine) is transposing a large matrix with a few very long rows, essentially performing an itertools.izip() on iterators that pick out the elements of a single column. The idea is not not have the entire file in memory during iteration.

行是以空格分隔的 ASCII 十进制数.

The rows are space-delimited ASCII decimal numbers.

Java 的 Scanner 类的问题很简单,但我在 Python 标准库中没有看到任何内容似乎在没有字符串中的整个输入的情况下进行标记化.

The problem would be simple with Java's Scanner class, but I don't see anything in the Python Standard Library that appears to tokenize without having the whole input in a string.

为了记录,我知道如何自己写这个.我只是想知道是否有我错过的标准工具.可以 EasyInstalled 的 FOSS/libre 也不错,但我在 PYPI 上也看不到任何东西.

For the record, I know how to write this on my own. I'm just wondering if there's a standard tool that I missed. Something FOSS/libre that can be EasyInstalled is good, too, but I don't see anything on PYPI either.

完整的问题是获取样本输入:

The full problem was to take the sample input:

"123 3 234234 -35434 112312 54 -439 99 0 42
" +
"13 456 -78 910 333 -44 5555 6 8"

...并产生输出(作为生成器,无需一次将所有很长的行读入内存:

...and produce the output (as a generator, without reading all of very long rows into memory at once:

[123, 13], [3, 456], [234234, -78], ...etc

正如我所说,它本质上是 itertools.izip(iterator1, iterator2),将 iterator1 指向文件的开头,而 iterator2 刚好越过换行符读取第二行.

As I said, it's essentially itertools.izip(iterator1, iterator2), pointing iterator1 at the start of the file, and iterator2 just past the newline to read the second row.

推荐答案

从文件中一一读取token;您可以使用 re 模块从 内存映射文件:

To read tokens from a file one by one; you could use re module to generate tokens from a memory-mapped file:

#!/usr/bin/env python3
import re
import sys
from mmap import ACCESS_READ, mmap    

def generate_tokens(filename, pattern):
    with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
         yield from re.finditer(pattern, mm)

# sum all integers in a file specified at the command-line
print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'd+')))

即使文件不适合内存,它也能工作.

It works even if the file doesn't fit in memory.

这篇关于如何在不读取整行或文件的情况下读取令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆