在Python中线程化多个SQL查询的良好实践/设计是什么 [英] What is a good practice/design to thread mulitple SQL queries in Python

查看:77
本文介绍了在Python中线程化多个SQL查询的良好实践/设计是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 MySQLdb 和 BeautifulSoup .

该网站由大约15个不同的城市组成,每个城市都有10到150页的任意页面.总共约有500页.

The website is organized by about 15 different cities and each city has anywhere from 10 to 150 pages. There is a total of about 500 pages.

对于每个城市的每个页面,我使用BeautifulSoup打开网站,提取所有必要信息,然后执行insert intoupdate SQL查询.

For each page per city, I open the site using BeautifulSoup, extract all the neccessary information then perform a insert into or update SQL query.

当前我不使用线程,并且花了几分钟时间浏览所有500页,因为Python程序...

Currently I am not using threads, and it takes a few minutes to go through all 500 pages because the Python program...

  1. 打开页面.
  2. 提取信息.
  3. 执行SQL查询.
  4. 打开下一页...

理想情况下,我想通过具有10个并发线程(每个线程可打开约50页)来负载均衡.但是我认为编码可能太复杂了.

Ideally I would want to load balance the thread by having, say, 10 concurrent threads that open up about 50 pages each. But I think that may be too complicated to code.

因此,我正在考虑每个城市分配一个线程.我该怎么办?

So instead I am thinking of having one thread per city. How would I accomplish this?

当前,我的代码如下所示:

Currently my code looks like something like this:

//import threading
import BeautifulSoup
import urllib2
import MySQLdb

con = MySQLdb.connect( ... )

def open_page( url ):
    cur = con.cursor()
    // do SQL query

//Get a dictionary of city URL

cities = [
    'http://example.com/atlanta/',
    'http://example.com/los-angeles/',
    ...
    'http://example.com/new-york/'
]

for city_url in cities:
    soup = BeautifulSoup( urllib2.urlopen( city_url ) )

    // find every page per city
    pages = soup.findAll( 'div', { 'class' : 'page' } )

    for page in pages:
        page_url = page.find( 'a' )[ 'href' ]
        open_page( page_url )

推荐答案

您的最初想法是绝对可行的.只需启动10个工作线程,这些线程在一个相同的队列上等待输入.然后,您的邮件过程会将URL放入此队列.负载平衡将自动发生.

Your initial idea is absolutely feasible. Just start 10 worker threads that wait for input on one and the same queue. Then your mail process puts the urls into this queue. The load-balancing will happen automatically.

如果您的SQL绑定是线程安全的,则可以在工作线程中执行INSERT或UPDATE.否则,我将为SQL内容添加一个线程,以等待其他队列上的输入.然后,您的工作线程会将查询放入此队列,然后SQL线程将执行查询.

If your SQL bindings are thread-safe, you can do the INSERT or UPDATE stuff in the worker threads. Otherwise, I'd add one more thread for the SQL stuff, waiting for input on a different queue. Then your worker threads would put the query into this queue, and the SQL thread would execute it.

如果您用Google搜索"Python工作者线程队列",则会找到一些示例.

If you google for "python worker threads queue" you'll find a few examples.

这篇关于在Python中线程化多个SQL查询的良好实践/设计是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆