在阅读之前懒惰地过滤文件 [英] Lazyily filtering a file before reading

查看：78 发布时间：2017/11/3 19:07:41 python file

本文介绍了在阅读之前懒惰地过滤文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个BIG文件，里面有一些我希望忽略的行，还有一个带有文件对象的函数（ file_function ）。我可以返回一个新的文件对象，其行符合条件，而不必先读取整个文件，这个懒惰是重要的部分。

注意：我可以保存一个临时文件，忽略这些行，但这并不理想。 假设我有一个csv文件（坏行）：

$ $ $ $ $ c $ 1,2
ooops
3 ，4

首先尝试创建新的文件对象（使用与file相同的方法） code> readline ：

  class FileWithoutCondition（file）：
 def __init__ （self，f，condition）：
 self.f = f 
 self.condition = condition $ b $ def readline（self）：
而True：
x = self。 f.readline（）
如果self.condition（x）：
 return x

<如果 file_name 只使用 readline ...但是如果它需要其他功能则不行。

  with（'file_name'，'r'）作为f：
 f1 = FileWithoutOoops（f，lambda x：x！='ooops \\\
'）
 result = file_function（f1）
   
 $ b使用StringIO的解决方案可能工作，但我似乎无法得到它。 
 
 
 理想情况下，我们应该假定 file_function 是一个黑盒函数，具体来说我不能只是调整它来接受一个生成器（但也许我可以调整一个生成器）
 有没有一种标准的方式来做这种惰性（skim）阅读的一般文件？  
 
 
 注意：这个问题的动机是，只有 readline 不足以得到 pd.read_csv 工作...  
解决方案
在现有的Python工具中使用map-reduce方法。在这个例子中，我使用正则表达式匹配以字符串 GET / index 开头的行，但是可以使用任何适合您的帐单的条件：
 从集合中导入re 
 import defaultdict 
 
 pattern = re.compile（r'GET / index\ （。* \）.html'）
 
＃适当地定义FILE。 
＃map 
＃此处的条件用于过滤不能匹配的行。 
 matches =（pattern.search（line）for line in file（FILE，rb）if $ GET $ b $ mapp =（match.group（1）for match in matches if match ）
 
＃现在减少，lazy：
 count = defaultdict（int）
用于mapp中的请求：
 count [request] + = 1 
  
这可以在我的笔记本电脑上几秒钟内扫描大于6GB的文件。您可以进一步将文件分块，并将它们提供给线程或进程。使用 mmap 我不建议，除非你有内存映射整个文件（它不支持窗口）。
Suppose I have a BIG file with some lines I wish to ignore, and a function (file_function) which takes a file object. Can I return a new file object whose lines satisfy some condition without reading the entire file first, this laziness is the important part.


Note: I could just save a temporary file with these lines ignored, but this is not ideal.

For example, suppose I had a csv file (with a bad line):
1,2
ooops
3,4
A first attempt was to create new file object (with same methods as file) and overwrite readline:
class FileWithoutCondition(file):
    def __init__(self, f, condition):
        self.f = f
        self.condition = condition
    def readline(self):
        while True:
            x = self.f.readline()
            if self.condition(x):
                return x
This works if file_name only uses readline... but not if it requires some other functionality.
with ('file_name', 'r') as f:
    f1 = FileWithoutOoops(f, lambda x: x != 'ooops\n')
    result = file_function(f1)
A solution using StringIO may work, but I can't seem to get it to.

Ideally we should assume that file_function is a blackbox function, specifically I can't just tweak it to accept a generator (but maybe I can tweak a generator to be file-like?).

Is there a standard way to do this kind of lazy (skim-)reading of a generic file?

Note: the motivating example to this question is this pandas question, where just having a readline is not enough to get pd.read_csv working...
 解决方案 
Use a map-reduce approach with existing Python facilities. In this example I'm using a regular expression for matching lines that start with the string GET /index, but you can use whatever condition fits your bill:
import re
from collections import defaultdict

pattern = re.compile(r'GET /index\(.*\).html')

# define FILE appropriately.
# map
# the condition here serves to filter lines that can not match.
matches = (pattern.search(line) for line in file(FILE, "rb") if 'GET' in line)
mapp    = (match.group(1) for match in matches if match)

# now reduce, lazy:
count = defaultdict(int)
for request in mapp:
    count[request] += 1
This scans a >6GB file in a few seconds on my laptop. You can further split a file in chunks and feed them to threads or processes. Use of mmap I do not recommend unless you have the memory to map the entire file (it doesn't support windowing).

                        这篇关于在阅读之前懒惰地过滤文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在阅读之前懒惰地过滤文件 [英] Lazyily filtering a file before reading

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在阅读之前懒惰地过滤文件 [英] Lazyily filtering a file before reading

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭