python从多个.gz文件中提取关键字 [英] python extract keywords from multiple .gz files

查看:379
本文介绍了python从多个.gz文件中提取关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:如何从Python中的多个文件(包括压缩的gz文件和未压缩的文件)中搜索关键字 我在一个文件夹中有多个存档日志,最新文件是邮件",而较旧的日志将自动压缩为.gz文件.

Question: How to search keywords from multiple files in Python(including compressed gz file and uncompressed file) I have multiple archived logs in a folder, the latest file is "messages",and the older logs will auto-compressed as .gz file.

  • -rw ------- 1根root 21262610 Nov 4 11:20消息

  • -rw------- 1 root root 21262610 Nov 4 11:20 messages

-rw ------- 1根root 3047453 Nov 2 15:49 messages-20191102-1572680982.gz

-rw------- 1 root root 3047453 Nov 2 15:49 messages-20191102-1572680982.gz

-rw ------- 1根root 3018032 Nov 3 04:43 messages-20191103-1572727394.gz

-rw------- 1 root root 3018032 Nov 3 04:43 messages-20191103-1572727394.gz

-rw ------- 1根root 3026617 Nov 3 17:32 messages-20191103-1572773536.gz

-rw------- 1 root root 3026617 Nov 3 17:32 messages-20191103-1572773536.gz

-rw ------- 1根root 3044692 Nov 4 06:17 messages-20191104-1572819469.gz

-rw------- 1 root root 3044692 Nov 4 06:17 messages-20191104-1572819469.gz

我写了一个函数:

  1. 将所有文件名存储在列表中.(成功)
  2. 打开列表中的每个文件(如果是gz文件),请使用gzip.open().
  3. 搜索关键字

但是我认为这种方式不是很聪明,因为实际上消息日志很大,并且被分成多个gz文件.而且我在关键字文件中存储了很多关键字.

but I think this way is not very smart, because actually the message log is very big and it is separated into multiple gz files.And I have lots of keywords stored in a keywords file.

因此,有更好的解决方案将所有文件串联到I/O流中,然后从流中提取关键字.

So is there a better solution to concatenate all files into a I/O stream and then extract keywords from the stream.

def open_all_message_files(path):

    files_list=[]
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.startswith("messages"):
                files_list.append(os.path.join(root,file))

    for x in files_list:
            if x.endswith('gz'):
                with gzip.open(x,"r") as f:
                    for line in f:
                        if b'keywords_1' in line:
                          print(line)
                        if b'keywords_2' in line:
                          print(line)
            else:
                with open(x,"r") as f:
                    for line in f:
                        if 'keywords_1' in line:
                            print(line)
                        if 'keywords_2' in line:
                            print(line)

推荐答案

这是我在stackoverflow中的第一个答案,所以请多多包涵. 我遇到了一个非常相似的问题,需要分析多个日志,其中一些日志非常庞大,无法完全容纳到内存中. 解决此问题的方法是创建一个数据处理管道,类似于unix/linux管道.背后的想法是将每个任务分解为各自的功能,并使用生成器来实现内存效率更高的方法.

This is my first answer in stackoverflow, so please bear with me. I had this very similar problem where I needed to analyze several logs, some of which were huge to fit entirely into memory. A solution to this problem, is to create a data processing pipeline, similar to a unix/linux pipeline. The idea behind is to break each task to their own individual function and use generators to achieve a more memory efficient approach.

import os
import gzip
import re
import fnmatch

def find_files(pattern, path):
    """
    Here you can find all the filenames that match a specific pattern
    using shell wildcard pattern that way you avoid hardcoding
    the file pattern i.e 'messages'
    """
    for root, dirs, files in os.walk(path):
        for name in fnmatch.filter(files, pattern):
            yield os.path.join(root, name)

def file_opener(filenames):
    """
    Open a sequence of filenames one at a time
    and make sure to close the file once we are done 
    scanning its content.
    """
    for filename in filenames:
        if filename.endswith('.gz'):
            f = gzip.open(filename, 'rt')
        else:
            f = open(filename, 'rt')
        yield f
        f.close()

def chain_generators(iterators):
    """
    Chain a sequence of iterators together
    """
    for it in iterators:
        # Look up yield from if you're unsure what it does
        yield from it

def grep(pattern, lines):
    """
    Look for a pattern in a line
    """
    pat = re.compile(pattern)
    for line in lines:
        if pat.search(line):
            yield line

# A simple way to use these functions together

logs = find_files('messages*', 'One/two/three')
files = file_opener(logs)
lines = chain_generators(files)
each_line = grep('keywords_1', lines)
for match in each_line:
    print(match)

如果您对我的回答有任何疑问,请告诉我

Let me know if you have any questions in regards to my answer

这篇关于python从多个.gz文件中提取关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆