带有代理支持的多线程蜘蛛Python包? [英] Python Package For Multi-Threaded Spider w/ Proxy Support?

查看:75
本文介绍了带有代理支持的多线程蜘蛛Python包?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

除了使用urllib之外,没有人知道可以通过http代理进行快速,多线程下载的URL的最有效的软件包吗?我知道诸如Twisted,Scrapy,libcurl等之类的东西,但我对它们还不够了解,无法做出决定,甚至他们也可以使用代理.谢谢!

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks!

推荐答案

很容易在python中实现.

is's simple to implement this in python.

urlopen()函数有效 透明地与代理 不需要身份验证.在Unix中 或Windows环境中,设置 http_proxy,ftp_proxy或gopher_proxy URL的环境变量 之前标识代理服务器 启动Python解释器

The urlopen() function works transparently with proxies which do not require authentication. In a Unix or Windows environment, set the http_proxy, ftp_proxy or gopher_proxy environment variables to a URL that identifies the proxy server before starting the Python interpreter

# -*- coding: utf-8 -*-

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser(host, root, charset):

    def parse():
        try:
            while True:
                url = queue.get_nowait()
                try:
                    content = urlopen(url).read().decode(charset)
                except UnicodeDecodeError:
                    continue
                for link in BeautifulSoup(content).findAll('a'):
                    try:
                        href = link['href']
                    except KeyError:
                        continue
                    if not href.startswith('http://'):
                        href = 'http://%s%s' % (host, href)
                    if not href.startswith('http://%s%s' % (host, root)):
                        continue
                    if href not in visited:
                        visited.add(href)
                        queue.put(href)
                        print href
        except Empty:
            pass

    return parse

if __name__ == '__main__':
    host, root, charset = sys.argv[1:]
    parser = get_parser(host, root, charset)
    queue.put('http://%s%s' % (host, root))
    workers = []
    for i in range(5):
        worker = Thread(target=parser)
        worker.start()
        workers.append(worker)
    for worker in workers:
        worker.join()

这篇关于带有代理支持的多线程蜘蛛Python包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆