并行文件匹配,Python [英] Parallel file matching, Python

查看:111
本文介绍了并行文件匹配,Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试改进一种扫描文件中是否包含恶意代码的脚本.我们在文件中有一个正则表达式模式列表,每行一个模式.这些正则表达式适用于grep,因为我们当前的实现基本上是bash脚本find \ grep组合. bash脚本在我的基准目录上花费358秒.我能够编写一个Python脚本,在72秒内做到了这一点,但还想进一步改进.首先,我将发布基本代码,然后进行调整:

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as our current implementation is basically a bash script find\grep combo. The bash script takes 358 seconds on my benchmark directory. I was able to write a python script that did this in 72 seconds but want to improve more. First I will post the base-code and then tweaks I have tried:

import os, sys, Queue, threading, re

fileList = []
rootDir = sys.argv[1]

class Recurser(threading.Thread):

    def __init__(self, queue, dir):
    self.queue = queue
    self.dir = dir
    threading.Thread.__init__(self)

    def run(self):
    self.addToQueue(self.dir)

    ## HELPER FUNCTION FOR INTERNAL USE ONLY
    def addToQueue(self,  rootDir):
      for root, subFolders, files in os.walk(rootDir):
    for file in files:
       self.queue.put(os.path.join(root,file))
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)

class Scanner(threading.Thread):

    def __init__(self, queue, patterns):
    self.queue = queue
    self.patterns = patterns
    threading.Thread.__init__(self)

    def run(self):
    nextFile = self.queue.get()
    while nextFile is not -1:
       #print "Trying " + nextFile
       self.scanFile(nextFile)
       nextFile = self.queue.get()


    #HELPER FUNCTION FOR INTERNAL UES ONLY
    def scanFile(self, file):
       fp = open(file)
       contents = fp.read()
       i=0
       #for patt in self.patterns:
       if self.patterns.search(contents):
      print "Match " + str(i) + " found in " + file

############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################


fileQueue = Queue.Queue()

#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
   #patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
   giantRE = giantRE + line.rstrip() + '|'

giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)

#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()

print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
   scanner = Scanner(fileQueue, giantRE)
   scanner.start()

这显然是调试\丑陋的代码,不要介意百万queue.put(-1),我稍后会清理.某些缩进无法正确显示,尤其是在scanFile中.

This is obviously debugging\ugly code, do not mind the million queue.put(-1), I will clean this up later. Some indentations are not showing up properly, paticularly in scanFile.

无论如何,我注意到了一些事情.使用1、4甚至8个线程(对于xrange(0,???):中的扫描器)没有什么区别.无论如何,我仍然能获得〜72秒的时间.我认为这是由于python的GIL所致.

Anyway some things I've noticed. Using 1, 4, and even 8 threads (for scanner in xrange(0,???):) does not make a difference. I still get ~72 seconds regardless. I assume this is due to python's GIL.

与制作大型正则表达式相反,我尝试将每行(模式)作为compilex RE放在列表中,并在我的scanfile函数中遍历此列表.这导致执行时间更长.

As opposed to making a giant regex I tried placing each line (pattern) as a compilex RE in a list and iterating through this list in my scanfile function. This resulted in longer execution time.

为了避免python的GIL,我尝试将每个线程派生到grep,如下所示:

In an effort to avoid python's GIL I tried having each thread fork to grep as in:

#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
      s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
      output = s.communicate()[0]
      if output != '':
         print 'Matchfound in ' + file

这导致更长的执行时间.

This resulted in longer execution time.

关于提高性能的任何建议.

Any suggestions on improving performance.

:::::::::::::: EDIT ::::::::

::::::::::::::::::::

我无法发布自己的问题的答案,但是以下是对提出的几点看法的答案:

I can not post answers to my own questions yet however here are the answers to several points raised:

@David Nehme-为了让人们知道我知道我有一百万个队列这一事实.put(-1)是

@David Nehme - Just to let people know I am aware of the fact that I have a million queue.put(-1)'s

@Blender-标记队列的底部.我的扫描程序线程不断出队列,直到它们打到底部的-1(而nextFile不是-1 :).处理器核心为8,但是由于使用1个线程,4个线程或8个线程的GIL并没有区别.生成8个子流程会导致代码运行速度显着降低(142秒与72秒)

@Blender - To mark the bottom of the queue. My scanner threads keep dequeing until they hit -1 which is at the bottom (while nextFile is not -1:). The processor cores is 8 however due to the GIL using 1 thread, 4 threads, or 8 threads does NOT make a difference. Spawning 8 subprocesses resulted in significantly slower code (142 sec vs 72)

@ed-是的,它的速度与find \ grep组合一样慢,但实际上要慢一些,因为它会随意抓取不需要的文件

@ed - Yes that and it's just as slow as the find\grep combo, actually slower because it indiscriminately greps file that aren't needed

@Ron-无法升级,这必须是通用的.您认为这会加快> 72秒吗? bash grepper会执行358秒.我的python巨人RE方法执行72秒,使用1-8个线程.在142秒时运行了8个thrads(8个子进程)的popen方法.到目前为止,到目前为止,仅RE python这样的巨型方法显然是赢家

@Ron - Can't upgrade, this must be universal. Do you think this will speed up > 72 seconds? The bash grepper does 358 seconds. My python giant RE method does 72 seconds w\ 1-8 threads. The popen method w\ 8 thrads (8 subprocesses) ran at 142 seconds. So far the giant RE python only method is a clear winner by far

@intulated

@intuted

这是我们当前的find \ grep组合的内容(不是我的脚本).很简单那里还有其他类似ls的东西,但是没有什么应该导致5倍的减速.即使grep -r的效率稍微提高了5倍,仍然是巨大的放缓.

Here's the meat of our current find\grep combo (Not my script). It's pretty simple. There are some additional things in there like ls, but nothing that should result in a 5x slowdown. Even if grep -r is slightly more efficient 5x is a HUGE slowdown.

 find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"

python代码效率更高,我不知道为什么,但是我通过实验对其进行了测试.我更喜欢在python中执行此操作.我已经使用python实现了5倍的加速,我希望它能进一步加速.

The python code is more efficient, I don't know why, but I experimentally tested it. I prefer to do this in python. I already achieved a speedup of 5x with python, I would like to get it sped up more.

:::::::::::::: WINNER WINNER WINNER :::::::::::::::::

:::::::::::::WINNER WINNER WINNER:::::::::::::::::

好像我们有赢家.

intued的shell脚本以34秒排在第二位,而@steveha的shell脚本以24秒排在第二位.由于我们的许多包装盒都没有python2.6,因此我不得不cx_freeze冻结它.我可以编写一个shell脚本包装程序来获取tar并将其解压缩.不过,我还是喜欢intued的简单性.

intued's shell script comes in 2nd place with 34 seconds however @steveha's came in first with 24 seconds. Due to the fact that a lot of our boxes do not have python2.6 I had to cx_freeze it. I can write a shell script wrapper to wget a tar and unpack it. I do like intued's for simplicity however.

感谢您的所有帮助人员,我现在有一个有效的系统管理工具

Thanks you for all your help guys, I now have an efficient tool for sysadmining

推荐答案

我认为您应该在Python解决方案中使用multiprocessing模块,而不是使用threading模块. Python线程可能会违反GIL;如果您只是要运行多个Python进程,那么GIL就不是问题.

I think that, rather than using the threading module, you should be using the multiprocessing module for your Python solution. Python threads can run afoul of the GIL; the GIL is not a problem if you simply have multiple Python processes going.

我认为,对于您正在执行的工作进程,池只是您想要的.默认情况下,池将为系统处理器中的每个核心默认为一个进程.只需使用要检查的文件名列表和执行检查的函数调用.map()方法即可.

I think that for what you are doing a pool of worker processes is just what you want. By default, the pool will default to one process for each core in your system processor. Just call the .map() method with a list of filenames to check and the function that does the checking.

http://docs.python.org/library/multiprocessing.html

如果这不比您的threading实现快,那么我认为GIL不是您的问题.

If this is not faster than your threading implementation, then I don't think the GIL is your problem.

好的,我正在添加一个有效的Python程序.这使用工作进程池来打开每个文件并在每个文件中搜索模式.当工作人员找到匹配的文件名时,它只是将其打印(到标准输出),因此您可以将此脚本的输出重定向到文件中,并获得文件列表.

Okay, I'm adding a working Python program. This uses a pool of worker processes to open each file and search for the pattern in each. When a worker finds a filename that matches, it simply prints it (to standard output) so you can redirect the output of this script into a file and you have your list of files.

我认为这是一个易于阅读,易于理解的版本.

I think this is a slightly easier to read version, easier to understand.

我设置了时间,在计算机上的/usr/include中搜索文件.它会在大约半秒钟内完成搜索.使用通过xargs传递的find来运行尽可能少的grep进程,大约需要0.05秒,大约是10倍的加速.但是我讨厌必须使用巴洛克怪异语言才能使find正常工作,而且我喜欢Python版本.也许在非常大的目录中,差异会更小,因为Python半秒的一部分一定是启动时间.对于大多数用途来说,也许半秒就足够了!

I timed this, searching through the files in /usr/include on my computer. It completes the search in about half a second. Using find piped through xargs to run as few grep processes as possible, it takes about 0.05 seconds, about a 10x speedup. But I hate the baroque weird language you must use to get find to work properly, and I like the Python version. And perhaps on really big directories the disparity would be smaller, as part of the half-second for Python must have been startup time. And maybe half a second is fast enough for most purposes!

import multiprocessing as mp
import os
import re
import sys

from stat import S_ISREG


# uncomment these if you really want a hard-coded $HOME/patterns file
#home = os.environ.get('HOME')
#patterns_file = os.path.join(home, 'patterns')

target = sys.argv[1]
size_limit = int(sys.argv[2])
assert size_limit >= 0
patterns_file = sys.argv[3]


# build s_pat as string like:  (?:foo|bar|baz)
# This will match any of the sub-patterns foo, bar, or baz
# but the '?:' means Python won't bother to build a "match group".
with open(patterns_file) as f:
    s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))

# pre-compile pattern for speed
pat = re.compile(s_pat)


def walk_files(topdir):
    """yield up full pathname for each file in tree under topdir"""
    for dirpath, dirnames, filenames in os.walk(topdir):
        for fname in filenames:
            pathname = os.path.join(dirpath, fname)
            yield pathname

def files_to_search(topdir):
    """yield up full pathname for only files we want to search"""
    for fname in walk_files(topdir):
        try:
            # if it is a regular file and big enough, we want to search it
            sr = os.stat(fname)
            if S_ISREG(sr.st_mode) and sr.st_size >= size_limit:
                yield fname
        except OSError:
            pass

def worker_search_fn(fname):
    with open(fname, 'rt') as f:
        # read one line at a time from file
        for line in f:
            if re.search(pat, line):
                # found a match! print filename to stdout
                print(fname)
                # stop reading file; just return
                return

mp.Pool().map(worker_search_fn, files_to_search(target))

这篇关于并行文件匹配,Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆