如何读取令牌而不读取整行或文件 [英] How to read tokens without reading whole line or file
问题描述
itertools.izip()
列出单列的元素。这个想法在迭代过程中并不是没有将整个文件存储在内存中。这些行是空格分隔的ASCII十进制数字。
Java的Scanner类的问题很简单,但是在Python标准库中看不到任何字符串中没有整个输入的东西。
为了记录,我知道如何自己写这个。我只是想知道是否有一个标准的工具,我错过了。一些可以EasyInstalled的FOSS / libre也不错,但我在PYPI上也看不到任何东西。
完整的问题是采样输入: / p>
123 3 234234 -35434 112312 54 -439 99 0 42\\\
+
13 456 -78 910 333 -44 5555 6 8
...并产生输出(作为一个发生器,没有立即将所有非常长的行读入内存中:
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $' -78],... etc
正如我所说,本质上是itertools.izip(iterator1,iterator2 ),将iterator1指向文件的开头,iterator2刚刚通过换行符来读取第二行。 阅读您可以使用 即使文件不适合内存,它也可以工作。 Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else's problem, not mine) is transposing a large matrix with a few very long rows, essentially performing an The rows are space-delimited ASCII decimal numbers. The problem would be simple with Java's Scanner class, but I don't see anything in the Python Standard Library that appears to tokenize without having the whole input in a string. For the record, I know how to write this on my own. I'm just wondering if there's a standard tool that I missed. Something FOSS/libre that can be EasyInstalled is good, too, but I don't see anything on PYPI either. The full problem was to take the sample input: ...and produce the output (as a generator, without reading all of very long rows into memory at once: As I said, it's essentially itertools.izip(iterator1, iterator2), pointing iterator1 at the start of the file, and iterator2 just past the newline to read the second row. To read tokens from a file one by one; you could use It works even if the file doesn't fit in memory. 这篇关于如何读取令牌而不读取整行或文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! re
模块来从:
<$ p $
im端口re
从mmap导入sys
导入ACCESS_READ,mmap
$ b $ def generate_tokens(文件名,模式):
打开(文件名)为f,mmap(f。 fileno(),0,access = ACCESS_READ)mm:
从re.finditer(pattern,mm)得出
#在命令行中指定的文件中包含所有整数
print(sum(int(m.group())for generate_tokens(sys.argv [1],br'\d +')))
itertools.izip()
on iterators that pick out the elements of a single column. The idea is not not have the entire file in memory during iteration."123 3 234234 -35434 112312 54 -439 99 0 42\n" +
"13 456 -78 910 333 -44 5555 6 8"
[123, 13], [3, 456], [234234, -78], ...etc
re
module to generate tokens from a memory-mapped file:#!/usr/bin/env python3
import re
import sys
from mmap import ACCESS_READ, mmap
def generate_tokens(filename, pattern):
with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
yield from re.finditer(pattern, mm)
# sum all integers in a file specified at the command-line
print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'\d+')))