在阅读之前懒惰地过滤文件 [英] Lazyily filtering a file before reading
问题描述
file_function
)。我可以返回一个新的文件对象,其行符合条件,而不必先读取整个文件,这个懒惰是重要的部分。 注意:我可以保存一个临时文件,忽略这些行,但这并不理想。 假设我有一个csv文件(坏行):
$ $ $ $ $ c $ 1,2
ooops
3 ,4
首先尝试创建新的文件对象(使用与file相同的方法) code> readline :
class FileWithoutCondition(file):
def __init__ (self,f,condition):
self.f = f
self.condition = condition $ b $ def readline(self):
而True:
x = self。 f.readline()
如果self.condition(x):
return x
<如果 file_name
只使用 readline
...但是如果它需要其他功能则不行。
with('file_name','r')作为f:
$
f1 = FileWithoutOoops(f,lambda x:x!='ooops \\\
')
result = file_function(f1)
$ b使用StringIO的解决方案可能工作,但我似乎无法得到它。
理想情况下,我们应该假定
file_function
是一个黑盒函数,具体来说我不能只是调整它来接受一个生成器(但也许我可以调整一个生成器)
有没有一种标准的方式来做这种惰性(skim)阅读的一般文件?
注意:这个问题的动机是,只有
readline
不足以得到pd.read_csv
工作...解决方案在现有的Python工具中使用map-reduce方法。在这个例子中,我使用正则表达式匹配以字符串
GET / index
开头的行,但是可以使用任何适合您的帐单的条件:从集合中导入re
import defaultdict
pattern = re.compile(r'GET / index\ (。* \).html')
#适当地定义FILE。
#map
#此处的条件用于过滤不能匹配的行。
matches =(pattern.search(line)for line in file(FILE,rb)if $ GET $ b $ mapp =(match.group(1)for match in matches if match )
#现在减少,lazy:
count = defaultdict(int)
用于mapp中的请求:
count [request] + = 1
这可以在我的笔记本电脑上几秒钟内扫描大于6GB的文件。您可以进一步将文件分块,并将它们提供给线程或进程。使用
mmap
我不建议,除非你有内存映射整个文件(它不支持窗口)。Suppose I have a BIG file with some lines I wish to ignore, and a function (
file_function
) which takes a file object. Can I return a new file object whose lines satisfy some condition without reading the entire file first, this laziness is the important part.Note: I could just save a temporary file with these lines ignored, but this is not ideal.
For example, suppose I had a csv file (with a bad line):
1,2 ooops 3,4
A first attempt was to create new file object (with same methods as file) and overwrite
readline
:class FileWithoutCondition(file): def __init__(self, f, condition): self.f = f self.condition = condition def readline(self): while True: x = self.f.readline() if self.condition(x): return x
This works if
file_name
only usesreadline
... but not if it requires some other functionality.with ('file_name', 'r') as f: f1 = FileWithoutOoops(f, lambda x: x != 'ooops\n') result = file_function(f1)
A solution using StringIO may work, but I can't seem to get it to.
Ideally we should assume that
file_function
is a blackbox function, specifically I can't just tweak it to accept a generator (but maybe I can tweak a generator to be file-like?).
Is there a standard way to do this kind of lazy (skim-)reading of a generic file?Note: the motivating example to this question is this pandas question, where just having a
readline
is not enough to getpd.read_csv
working...解决方案Use a map-reduce approach with existing Python facilities. In this example I'm using a regular expression for matching lines that start with the string
GET /index
, but you can use whatever condition fits your bill:import re from collections import defaultdict pattern = re.compile(r'GET /index\(.*\).html') # define FILE appropriately. # map # the condition here serves to filter lines that can not match. matches = (pattern.search(line) for line in file(FILE, "rb") if 'GET' in line) mapp = (match.group(1) for match in matches if match) # now reduce, lazy: count = defaultdict(int) for request in mapp: count[request] += 1
This scans a >6GB file in a few seconds on my laptop. You can further split a file in chunks and feed them to threads or processes. Use of
mmap
I do not recommend unless you have the memory to map the entire file (it doesn't support windowing).这篇关于在阅读之前懒惰地过滤文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!