带有代理支持的多线程蜘蛛Python包? [英] Python Package For Multi-Threaded Spider w/ Proxy Support?

查看：75 发布时间：2020/5/29 0:58:24 python proxy multithreading web-crawler pool

本文介绍了带有代理支持的多线程蜘蛛Python包?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

除了使用urllib之外，没有人知道可以通过http代理进行快速，多线程下载的URL的最有效的软件包吗?我知道诸如Twisted，Scrapy，libcurl等之类的东西，但我对它们还不够了解，无法做出决定，甚至他们也可以使用代理.谢谢！

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks!

推荐答案

很容易在python中实现.

is's simple to implement this in python.

urlopen()函数有效透明地与代理不需要身份验证.在Unix中或Windows环境中，设置 http_proxy，ftp_proxy或gopher_proxy URL的环境变量之前标识代理服务器启动Python解释器

The urlopen() function works transparently with proxies which do not require authentication. In a Unix or Windows environment, set the http_proxy, ftp_proxy or gopher_proxy environment variables to a URL that identifies the proxy server before starting the Python interpreter

# -*- coding: utf-8 -*-

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser(host, root, charset):

    def parse():
        try:
            while True:
                url = queue.get_nowait()
                try:
                    content = urlopen(url).read().decode(charset)
                except UnicodeDecodeError:
                    continue
                for link in BeautifulSoup(content).findAll('a'):
                    try:
                        href = link['href']
                    except KeyError:
                        continue
                    if not href.startswith('http://'):
                        href = 'http://%s%s' % (host, href)
                    if not href.startswith('http://%s%s' % (host, root)):
                        continue
                    if href not in visited:
                        visited.add(href)
                        queue.put(href)
                        print href
        except Empty:
            pass

    return parse

if __name__ == '__main__':
    host, root, charset = sys.argv[1:]
    parser = get_parser(host, root, charset)
    queue.put('http://%s%s' % (host, root))
    workers = []
    for i in range(5):
        worker = Thread(target=parser)
        worker.start()
        workers.append(worker)
    for worker in workers:
        worker.join()

这篇关于带有代理支持的多线程蜘蛛Python包?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

带有代理支持的多线程蜘蛛Python包? [英] Python Package For Multi-Threaded Spider w/ Proxy Support?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

带有代理支持的多线程蜘蛛Python包? [英] Python Package For Multi-Threaded Spider w/ Proxy Support?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭