在阅读之前懒惰地过滤文件 [英] Lazyily filtering a file before reading

查看:78
本文介绍了在阅读之前懒惰地过滤文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个BIG文件,里面有一些我希望忽略的行,还有一个带有文件对象的函数( file_function )。我可以返回一个新的文件对象,其行符合条件,而不必先读取整个文件,这个懒惰是重要的部分。



注意:我可以保存一个临时文件,忽略这些行,但这并不理想。 假设我有一个csv文件(坏行):

$ $ $ $ $ c $ 1,2
ooops
3 ,4

首先尝试创建新的文件对象(使用与file相同的方法) code> readline :

  class FileWithoutCondition(file):
def __init__ (self,f,condition):
self.f = f
self.condition = condition $ b $ def readline(self):
而True:
x = self。 f.readline()
如果self.condition(x):
return x



<如果 file_name 只使用 readline ...但是如果它需要其他功能则不行。

  with('file_name','r')作为f:
f1 = FileWithoutOoops(f,lambda x:x!='ooops \\\
')
result = file_function(f1)

$ b使用StringIO的解决方案可能工作,但我似乎无法得到它。



理想情况下,我们应该假定 file_function 是一个黑盒函数,具体来说我不能只是调整它来接受一个生成器(但也许我可以调整一个生成器)
有没有一种标准的方式来做这种惰性(skim)阅读的一般文件?


注意:这个问题的动机是,只有 readline 不足以得到 pd.read_csv 工作...

在现有的Python工具中使用map-reduce方法。在这个例子中,我使用正则表达式匹配以字符串 GET / index 开头的行,但是可以使用任何适合您的帐单的条件:

 从集合中导入re 
import defaultdict

pattern = re.compile(r'GET / index\ (。* \).html')

#适当地定义FILE。
#map
#此处的条件用于过滤不能匹配的行。
matches =(pattern.search(line)for line in file(FILE,rb)if $ GET $ b $ mapp =(match.group(1)for match in matches if match )

#现在减少,lazy:
count = defaultdict(int)
用于mapp中的请求:
count [request] + = 1

这可以在我的笔记本电脑上几秒钟内扫描大于6GB的文件。您可以进一步将文件分块,并将它们提供给线程或进程。使用 mmap 我不建议,除非你有内存映射整个文件(它不支持窗口)。

Suppose I have a BIG file with some lines I wish to ignore, and a function (file_function) which takes a file object. Can I return a new file object whose lines satisfy some condition without reading the entire file first, this laziness is the important part.

Note: I could just save a temporary file with these lines ignored, but this is not ideal.

For example, suppose I had a csv file (with a bad line):

1,2
ooops
3,4

A first attempt was to create new file object (with same methods as file) and overwrite readline:

class FileWithoutCondition(file):
    def __init__(self, f, condition):
        self.f = f
        self.condition = condition
    def readline(self):
        while True:
            x = self.f.readline()
            if self.condition(x):
                return x

This works if file_name only uses readline... but not if it requires some other functionality.

with ('file_name', 'r') as f:
    f1 = FileWithoutOoops(f, lambda x: x != 'ooops\n')
    result = file_function(f1)

A solution using StringIO may work, but I can't seem to get it to.

Ideally we should assume that file_function is a blackbox function, specifically I can't just tweak it to accept a generator (but maybe I can tweak a generator to be file-like?).
Is there a standard way to do this kind of lazy (skim-)reading of a generic file?

Note: the motivating example to this question is this pandas question, where just having a readline is not enough to get pd.read_csv working...

解决方案

Use a map-reduce approach with existing Python facilities. In this example I'm using a regular expression for matching lines that start with the string GET /index, but you can use whatever condition fits your bill:

import re
from collections import defaultdict

pattern = re.compile(r'GET /index\(.*\).html')

# define FILE appropriately.
# map
# the condition here serves to filter lines that can not match.
matches = (pattern.search(line) for line in file(FILE, "rb") if 'GET' in line)
mapp    = (match.group(1) for match in matches if match)

# now reduce, lazy:
count = defaultdict(int)
for request in mapp:
    count[request] += 1

This scans a >6GB file in a few seconds on my laptop. You can further split a file in chunks and feed them to threads or processes. Use of mmap I do not recommend unless you have the memory to map the entire file (it doesn't support windowing).

这篇关于在阅读之前懒惰地过滤文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆