提高Python脚本的速度:多线程还是多个实例? [英] Improve Speed of Python Script: Multithreading or Multiple Instances?

查看:310
本文介绍了提高Python脚本的速度:多线程还是多个实例?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Python脚本,希望每天运行,我希望它只需要1-2个小时即可运行.当前已设置为给定URL击中4个不同的API,捕获结果,然后将数据保存到PostgreSQL数据库中.问题是我要遍历160,000个URL,脚本最终要花很长时间-我进行了一些初步测试,以目前的格式浏览每个URL将花费36个小时以上.因此,我的问题可以归结为:我应该优化脚本以同时运行多个线程吗?还是应该扩展正在使用的服务器数量?显然,第二种方法的成本更高,因此我希望在同一实例上运行多个线程.

I have a Python script that I'd like to run everyday and I'd prefer that it only takes 1-2 hours to run. It's currently setup to hit 4 different APIs for a given URL, capture the results, and then save the data into a PostgreSQL database. The problem is I have over 160,000 URLs to go through and the script ends up taking a really long time -- I ran some preliminary tests and it would take over 36 hours to go through each URL in its current format. So, my question boils down to: should I optimize my script to run multiple threads at the same time? Or should I scale out the number of servers I'm using? Obviously the second approach will be more costly so I'd prefer to have multiple threads running on the same instance.

我使用的是我创建的库( SocialAnalytics ),该库提供了实现不同API端点的方法并解析结果.这是我配置脚本的方式:

I'm using a library I created (SocialAnalytics) which provides methods to hit the different API endpoints and parse the results. Here's how I have my script configured:

import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep

conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()

# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()

for url in urls:

    # Pinterest
    try:
        p = pinterest.getPins(url[2])
    except:
        p = { 'pin_count': 0 }
    # Facebook
    try:
        f = facebook.getObject(url[2])
    except:
        f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
    # Twitter
    try:
        t = twitter.getShares(url[2])
    except:
        t = { 'share_count': 0 }
    # Google
    try:
        g = google_plus.getPlusOnes(url[2])
    except:
        g = { 'plus_count': 0 }

    # Save results
    try:
        now = strftime("%Y-%m-%d %H:%M:%S")
        cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
        conn.commit()
    except:
        conn.rollback()

您可以看到对API的每次调用都使用请求库,这是一个同步的阻塞事件.经过一些初步研究,我发现了 Treq ,它是

You can see that each call to the API is using the Requests library, which is a synchronous, blocking affair. After some preliminary research I discovered Treq, which is an API on top of Twisted. The asynchronous, non-blocking nature of Twisted seems like a good candidate for improving my approach, but I've never worked with it and I'm not sure how exactly (and if) it'll help me achieve my goal.

非常感谢任何指导!

推荐答案

首先,您应该测量脚本在每个步骤上花费的时间.可能是您发现了一些有趣的东西:)

At first you should measure time that your script spends on every step. May be you discover something interesting :)

第二,您可以将网址拆分为多个块:

Second, you can split your urls on chunks:

chunk_size = len(urls)/cpu_core_count; // don't forget about remainder of division

完成这些步骤后,您可以使用多重处理并行处理每个块.这是给您的示例:

After these steps you can use multiprocessing for processing every chunk in parallel. Here is example for you:

import multiprocessing as mp

p = mp.Pool(5)

# first solution
for urls_chunk in urls: # urls = [(url1...url6),(url7...url12)...]
    res = p.map(get_social_stat, urls_chunk)
    for record in res:
        save_to_db(record)

# or, simple
res = p.map(get_social_stat, urls)

for record in res:
   save_to_db(record)

此外, gevent 可以为您提供帮助.因为它可以优化在同步阻止请求的处理序列上的时间花费.

Also, gevent can help you. Because it can optimize time spending on processing sequence of synchronous blocking requests.

这篇关于提高Python脚本的速度:多线程还是多个实例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆