Python Scrapy Parse 提取的链接与另一个函数 [英] Python Scrapy Parse extracted link with another function

查看:40
本文介绍了Python Scrapy Parse 提取的链接与另一个函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scrapy的新手,我正在尝试抓取黄页以进行学习,一切正常,但我想要电子邮件地址,但要做到这一点,我需要访问在 parse 中提取的链接并使用另一个 parse_email 函数对其进行解析,但它没有炒锅.

I am new to scrapy i am trying to scrape yellowpages for learning purposes everything works fine but i want the email address, but to do that i need to visit links extracted inside parse and parse it with another parse_email function but it does not wok .

我的意思是我测试了 parse_email 函数它可以工作但它在主解析函数内部不起作用,我希望 parse_email 函数获取链接的源,所以我使用回调调用 parse_email 函数但它只返回像这样的链接 where由于某种原因,它应该返回电子邮件 parse_email 函数不起作用,只是返回链接而不打开页面

I mean i tested the parse_email function it works but it does not work from inside the main parse function, i want the parse_email function to get source of the link, so i am calling the parse_email function using the callback but it only returns links like these <GET https://www.yellowpages.com/los-angeles-ca/mip/palm-tree-la-7254813?lid=7254813> where it should return the email for some reason the parse_email function is not working and just returning the link without opening the page

这是我对部分进行注释的代码

import scrapy
import requests
from urlparse import urljoin

scrapy.optional_features.remove('boto')

class YellowSpider(scrapy.Spider):
    name = 'yellow spider'
    start_urls = ['https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Los+Angeles%2C+CA']

    def parse(self, response):
        SET_SELECTOR = '.info'
        for brickset in response.css(SET_SELECTOR):

            NAME_SELECTOR = 'h3 a ::text'
            ADDRESS_SELECTOR = '.adr ::text'
            PHONE = '.phone.primary ::text'
            WEBSITE = '.links a ::attr(href)'


            #Getiing the link of the page that has the email usiing this selector
            EMAIL_SELECTOR = 'h3 a ::attr(href)'

            #extracting the link
            email = brickset.css(EMAIL_SELECTOR).extract_first()

            #joining and making complete url
            url = urljoin(response.url, brickset.css('h3 a ::attr(href)').extract_first())



            yield {
                'name': brickset.css(NAME_SELECTOR).extract_first(),
                'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
                'phone': brickset.css(PHONE).extract_first(),
                'website': brickset.css(WEBSITE).extract_first(),

                #ONLY Returning Link of the page not calling the function

                'email': scrapy.Request(url, callback=self.parse_email),
            }

        NEXT_PAGE_SELECTOR = '.pagination ul a ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract()[-1]
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

    def parse_email(self, response):

        #xpath for the email address in the nested page

        EMAIL_SELECTOR = '//a[@class="email-business"]/@href'

        #returning the extracted email WORKS XPATH WORKS I CHECKED BUT FUNCTION NOT CALLING FOR SOME REASON
        yield {
            'email': response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
        }

我不知道我做错了什么

推荐答案

您正在生成一个 dict ,其中包含一个 Request,Scrapy 不会发送它,因为它不知道它在那里(它们在创建后不会被自动调度).您需要产生实际的Request.

You are yielding a dict with a Request inside of it, Scrapy won't dispatch it because it doesn't know it's there (they don't get dispatched automatically after creating them). You need to yield the actual Request.

parse_email 函数中,为了记住"每封电子邮件属于哪个项目,您需要将其余的项目数据与请求一起传递.您可以使用 meta 参数执行此操作.

In the parse_email function, in order to "remember" which item each email belongs to, you will need to pass the rest of the item data alongside the request. You can do this with the meta argument.

示例:

解析中:

yield scrapy.Request(url, callback=self.parse_email, meta={'item': {
    'name': brickset.css(NAME_SELECTOR).extract_first(),
    'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
    'phone': brickset.css(PHONE).extract_first(),
    'website': brickset.css(WEBSITE).extract_first(),
}})

parse_email中:

item = response.meta['item']  # The item this email belongs to
item['email'] = response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
return item

这篇关于Python Scrapy Parse 提取的链接与另一个函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆