如何使用python请求和事件钩子编写一个web回调函数的回调函数? [英] How to use python-requests and event hooks to write a web crawler with a callback function?

查看:980
本文介绍了如何使用python请求和事件钩子编写一个web回调函数的回调函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近看过 python-requests 模组,我想撰写一个简单的web爬行器。给定一个开始url的集合,我想写一个Python函数,搜索其他urls的起始url的网页内容,然后再次调用相同的函数作为回调与新的urls作为输入,等等。起初,我认为事件挂钩将会正确的工具为此目的,但其文档部分是相当稀疏。在另一页我读到用于事件钩子的函数必须返回传递给它们的同一个对象。所以事件钩子显然不可行这种任务。

这里是我想做的一些伪代码(从一个伪Scrapy蜘蛛借来):

  import lxml.html 

def parse(response):
for url in lxml.html.parse (response.url).xpath('// @ href'):
return Request(url = url,callback = parse)

有人能告诉我如何使用python请求吗?事件钩是正确的工具,或者我需要不同的东西? (注意: Scrapy对我来说不是一个选项,因为各种原因。)非常感谢。

解决方案

p>这里是我如何做:

  import grequests 
from bs4 import BeautifulSoup


def get_urls_from_response(r):
soup = BeautifulSoup(r.text)
urls = [link.get('href')for link in soup.find_all('a') ]
return urls


def print_url(args):
print args ['url']


def recursive_urls (urls):

给定起始URL列表,递归查找所有后代urb
递归

如果len 0:
return
rs = [grequests.get(url,hooks = dict(args = print_url))for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response)for response in responses]
urls = sum(url_lists,[])#将列表列表放入列表
recursive_urls(urls)



我没有测试代码,但一般的想法是。



注意我使用 grequests 代替请求以提高性能。 grequest 基本上是 gevent + request ,根据我的经验,因为您检索与 gevent 异步的链接。






编辑:此处是不使用递归的相同算法:

  import grequests 
from bs4 import BeautifulSoup


def get_urls_from_response(r):
soup = BeautifulSoup )
urls = [link.get('href')for the link in soup.find_all('a')]
return urls


def print_url ):
print args ['url']


def recursive_urls(urls):

给出起始URL列表,递归发现所有后代urls
递归

while True:
如果len(urls)== 0:
break
rs = [grequests。 get(url,hooks = dict(args = print_url))for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response)for response in responses]
urls = sum(url_lists,[])#将列表列表放入列表

如果__name__ ==__main__:
recursive_urls([INITIAL_URLS])


I've recently taken a look at the python-requests module and I'd like to write a simple web crawler with it. Given a collection of start urls, I want to write a Python function that searches the webpage content of the start urls for other urls and then calls the same function again as a callback with the new urls as input, and so on. At first, I thought that event hooks would be the right tool for this purpose but its documentation part is quite sparse. On another page I read that functions which are used for event hooks have to return the same object that was passed to them. So event hooks are obviously not feasible for this kind of task. Or I simply didn't get it right...

Here is some pseudocode of what I want to do (borrowed from a pseudo Scrapy spider):

import lxml.html    

def parse(response):
    for url in lxml.html.parse(response.url).xpath('//@href'):
        return Request(url=url, callback=parse)

Can someone give me an insight on how to do it with python-requests? Are event hooks the right tool for that or do I need something different? (Note: Scrapy is not an option for me due to various reasons.) Thanks a lot!

解决方案

Here is how I would do it:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    if len(urls) == 0:
        return
    rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
    responses = grequests.map(rs)
    url_lists = [get_urls_from_response(response) for response in responses]
    urls = sum(url_lists, [])  # flatten list of lists into a list
    recursive_urls(urls)

I haven't tested the code but the general idea is there.

Note that I am using grequests instead of requests for performance boost. grequest is basically gevent+request, and in my experience it is much faster for this sort of tasks because of you retrieve links asynchronous with gevent.


Edit: here is the same algorithm without using recursion:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    while True:
        if len(urls) == 0:
            break
        rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
        responses = grequests.map(rs)
        url_lists = [get_urls_from_response(response) for response in responses]
        urls = sum(url_lists, [])  # flatten list of lists into a list

if __name__ == "__main__":
    recursive_urls(["INITIAL_URLS"])

这篇关于如何使用python请求和事件钩子编写一个web回调函数的回调函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆