在大文件Python上执行多处理的最佳方法 [英] Best way to perform multiprocessing on a large file Python

查看:69
本文介绍了在大文件Python上执行多处理的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个python脚本,它将遍历一个列表(> 1000个元素),在一个大文件中找到该变量,然后输出结果.我正在读取整个文件> 1000次.我尝试使用多处理,但没有太大帮助. 这是我想做的事情:

I have a python script that would traverse a list(>1000 elements), find the variable in a large file and then output the result. I am reading the entire file >1000 times. I tried using multiprocessing, but not of much help. Here's what I am trying to do:

import gzip
from multiprocessing.pool import ThreadPool as Pool

def getForwardIP(clientIP, requestID):
   with gzip.open("xyz.log") as infile:
     for lines in infile:
        line= lines.split(" ")
        myRequestID= line[0]
        forwardIP= line[1]
        if myRequestID==requestID:
           print forwardIP
if __name__== "__main__":
    pool_size=8
    pool= Pool(pool_size)
    request_id_list= list()
    #request_id_list contains >1000 elements
    for id in request_id_list:
      pool.apply_async(getForwardIP, ("1.2.3.4.", id, ))
   pool.close()
   pool.join()

有更快的方法吗?任何帮助将不胜感激.谢谢!

Is there a faster way? Any help will be appreciated. Thanks!

编辑

(我在这里输入我的完整代码) 谢谢大家的建议.现在,我将文件写入列表,而不是读取1000次.我试图对for循环进行多进程处理,但没有成功.下面是代码:

(I AM WRITING MY ENTIRE CODE HERE) Thanks everyone for the suggestions. Now I am writing the file into a list rather than reading 1000 times. I tried to multi-process the for loop, but it didn't work. Below is the code:

import gzip
import datetime
from multiprocessing.pool import ThreadPool as Pool

def getRequestID(r_line_filename):
  requestIDList= list()
  with gzip.open(r_line_filename) as infile:
  #r_line_filename is a file with request_id and client_ip
    for lines in infile:
        line= lines.split(" ")
        requestID= line[1].strip("\n")
        myclientIP= line[0]
        if myclientIP==clientIP:
            requestIDList.append(requestID)  
   print "R line List Ready!"
   return(requestIDList)  



def getFLineList(fFilename):
   fLineList= list()
   with gzip.open(fFilename) as infile:
   #fFilename is a file with format request_id, forward_ip, epoch time
     for lines in infile:
        fLineList.append(lines.split())
  print "F line list ready!"
  return(fLineList)

def forwardIP(lines, requestID):
 myrequestID= lines[0]
 forwardIP= lines[1]
 epoch= int(lines[2].split(".")[0])
 timex= datetime.datetime.fromtimestamp(epoch).strftime('%Y-%m-%d %H:%M:%S')
 if myrequestID==requestID:
    print "%s %s %s"%(clientIP, timex, forwardIP)

if __name__== "__main__":
 pool= Pool()
 clientIP= "x.y.z.a"
 rLineList= getRequestID("rLine_subset.log.gz")
 fLineList= getFLineList("fLine_subset.log.gz")
 for RID in rLineList:
    for lines in fLineList:
        pool.apply_async(forwardIP, (lines, RID,))
    pool.close()
    pool.join()

多处理部分不起作用.实际上,这一步要慢得多.如果我不进行多重处理而只是遍历列表,则速度会更快.感谢您的提前帮助!

The multi-processing part is not working. Actually, this one is much slower. If I don't do multi-processing and simply traverse the list, it is faster. Thanks for your help in advance!

推荐答案

确实有一种更快的方法.不要在1000次内读取和解析文件.而是一次读一次,解析一次,然后存储.文件I/O是您可以执行的最慢的操作之一(使用任何语言).在内存中处理要快得多!

There is indeed a faster way. Don't read and parse the file in 1000 times. Instead, read it in once, parse it once, and store it. File I/O is one of the slowest things you can do (in any language). In memory processing is much faster!

这样的事情(由于我无法访问"xyz.log",因此未经测试.对于鹰派:很明显,我也没有配置它,但是我有一个偷偷摸摸的怀疑,一次读取文件比读取文件快1000次):

Something like this (obviously untested since I don't have "xyz.log" accessible to me. And for the hawks: obviously I didn't profile it either, but I have a sneaky suspicion reading a file once is faster than reading it 1000 times):

import gzip

def readFile():
  my_lines = []
  with gzip.open("xyz.log") as infile:
     for lines in infile:
        line = lines.split(" ")
        my_lines.append(line)

  return my_lines

def getForwardIp(lines, requestID): #Doesn't look like you need client IP (yet), so I nuked it
  myRequestID= line[0]
  forwardIP= line[1]
  if myRequestID==requestID:
     print forwardIP

if __name__ == "__main__":
  parsed_lines = readFile()
  request_id_list= list()
  #request_id_list contains >1000 elements
  for id in request_id_list:
    getForwardIp(parsed_lines, requestID)

这篇关于在大文件Python上执行多处理的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆