带有代理支持的多线程蜘蛛的 Python 包? [英] Python Package For Multi-Threaded Spider w/ Proxy Support?

查看:20
本文介绍了带有代理支持的多线程蜘蛛的 Python 包?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

除了使用 urllib 之外,有谁知道最有效的包可以快速、多线程下载可以通过 http 代理操作的 URL 吗?我知道一些,例如 Twisted、Scrapy、libcurl 等,但我对它们的了解不够,无法做出决定,甚至他们是否可以使用代理.有人知道最适合我的目的吗?谢谢!

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks!

推荐答案

在 python 中实现这个很简单.

is's simple to implement this in python.

urlopen() 函数有效透明地使用代理不需要认证.在一个 Unix或 Windows 环境,设置http_proxy、ftp_proxy 或 gopher_proxy环境变量到一个 URL识别代理服务器之前启动 Python 解释器

The urlopen() function works transparently with proxies which do not require authentication. In a Unix or Windows environment, set the http_proxy, ftp_proxy or gopher_proxy environment variables to a URL that identifies the proxy server before starting the Python interpreter

# -*- coding: utf-8 -*-

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser(host, root, charset):

    def parse():
        try:
            while True:
                url = queue.get_nowait()
                try:
                    content = urlopen(url).read().decode(charset)
                except UnicodeDecodeError:
                    continue
                for link in BeautifulSoup(content).findAll('a'):
                    try:
                        href = link['href']
                    except KeyError:
                        continue
                    if not href.startswith('http://'):
                        href = 'http://%s%s' % (host, href)
                    if not href.startswith('http://%s%s' % (host, root)):
                        continue
                    if href not in visited:
                        visited.add(href)
                        queue.put(href)
                        print href
        except Empty:
            pass

    return parse

if __name__ == '__main__':
    host, root, charset = sys.argv[1:]
    parser = get_parser(host, root, charset)
    queue.put('http://%s%s' % (host, root))
    workers = []
    for i in range(5):
        worker = Thread(target=parser)
        worker.start()
        workers.append(worker)
    for worker in workers:
        worker.join()

这篇关于带有代理支持的多线程蜘蛛的 Python 包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆