Python多进程池vs进程 [英] Python multiprocess Pool vs Process

查看:53
本文介绍了Python多进程池vs进程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python多重处理的新手.我不太了解Pool和Process之间的区别.有人可以建议我应该使用哪一个来满足我的需求吗?

I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?

我有数千个http GET请求要发送.发送每个并获取响应后,我想存储对(共享的)dict的响应(一个简单的int).我的最终目标是将字典中的所有数据写入文件.

I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.

这根本不占用CPU.我的全部目标是加快发送HTTP GET请求的速度,因为请求太多了.这些请求都是孤立的,彼此之间不依赖.

This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.

在这种情况下,我应该使用池"还是进程"?

Shall I use Pool or Process in this case?

谢谢!

----以下代码已添加到8/28 ---

我用多重处理编程.我面临的主要挑战是:

I programmed with multiprocessing. The key challenges I'm facing are:

1)GET请求有时可能会失败.我必须设置3次重试,以最大程度地减少重新运行代码/所有请求的需要.我只想重试失败的.我可以在不使用Pool的情况下通过异步http请求实现此目标吗?

1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?

2)我想检查每个请求的响应值,并进行异常处理

2) I want to check the response value of every requests, and have exception handling

下面的代码是从我的实际代码简化而来的.它工作正常,但我想知道这是否是最有效的处理方式.任何人都可以提出任何建议吗?非常感谢!

The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!

def get_data(endpoint, get_params):
    response = requests.get(endpoint, params = get_params)
    if response.status_code != 200:
        raise Exception("bad response for " + str(get_params))
    return response.json()

def get_currency_data(endpoint, currency, date):
    get_params = {'currency': currency,
                  'date' : date
                  }
    for attempt in range(3):
        try:
            output = get_data(endpoint, get_params)
            # additional return value check
            # ......
            return output['value']
        except:
            time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
    return 'error'

def get_all_data(currencies, dates):
    # I have many dates, but not too many currencies
    for currency in currencies:
        results = []
        pool = Pool(processes=20)
        for date in dates:
            results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
        output = [p.get() for p in results]
        pool.close()
        pool.join()
        time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control

推荐答案

都没有.使用异步编程.考虑下面的代码直接从该文章中提取(版权归PawełMiech所有)

Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)

#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession

async def fetch(url, session):
    async with session.get(url) as response:
        return await response.read()

async def run(r):
    url = "http://localhost:8080/{}"
    tasks = []

    # Fetch all responses within one Client session,
    # keep connection alive for all requests.
    async with ClientSession() as session:
        for i in range(r):
            task = asyncio.ensure_future(fetch(url.format(i), session))
            tasks.append(task)

        responses = await asyncio.gather(*tasks)
        # you now have all response bodies in this variable
        print(responses)

def print_responses(result):
    print(result)

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)

可能只是创建一个URL的数组,而不是给定的代码,而是对该数组进行循环,然后将每个数组都发布给fetch.

Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.

根据下面@roganjosh的评论, requests_futures 是一种超级简单的方法完成这个.

As per @roganjosh comment below, requests_futures is a super-easy way to accomplish this.

from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content 
            for url, future in responses.items()
            if future.result().status_code == 200}


使用grequests支持Python 2.7

您还可以使用grequests,它支持Python 2.7来执行异步URL调用.


Use grequests to support Python 2.7

You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.

import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]


使用多重处理

如果要使用多重处理来执行此操作,则可以.免责声明:这样一来,您将有大量的开销,并且效率不及异步编程...但是有可能.


Using multiprocessing

If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.

实际上非常简单,您可以通过http GET函数映射URL:

It's actually pretty straightforward, you're mapping the URL's through the http GET function:

import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)

池的大小将是同时发出的GET请求的数量.对其进行调整应该可以提高网络效率,但是会增加本地计算机上用于通信和派生的开销.

The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.

同样,我不建议这样做,但是这确实是可能的,并且如果您有足够的内核,它可能比同步进行调用要快.

Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.

这篇关于Python多进程池vs进程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆