使scrapy按顺序跟随链接 [英] Make scrapy follow links in order

查看:58
本文介绍了使scrapy按顺序跟随链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个脚本,在第一阶段使用 Scrapy 查找链接,然后跟踪链接并在第二阶段从页面中提取一些内容.Scrapy 做它但它以无序的方式跟随链接,即我希望输出如下:

I wrote a script and used Scrapy to find links in the first phase and follow the links and extract something from the page in the second phase. Scrapy DOES it BUT it follows the links in an unordered manner, i.e. I expect an output as below:

link1 | data_extracted_from_link1_destination_page
link2 | data_extracted_from_link2_destination_page
link3 | data_extracted_from_link3_destination_page
.
.
.

但我明白

link1 | data_extracted_from_link2_destination_page
link2 | data_extracted_from_link3_destination_page
link3 | data_extracted_from_link1_destination_page
.
.
.

这是我的代码:

import scrapy


class firstSpider(scrapy.Spider):
    name = "ipatranscription"
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']

    def parse(self, response):
        body = response.xpath('./body/div[3]/div[1]/div/a')
        LinkTextSelector = './text()'
        LinkDestSelector = './@href'

        for link in body:
            LinkText = link.xpath(LinkTextSelector).extract_first()
            LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first())

            yield {"LinkText": LinkText}
            yield scrapy.Request(url=LinkDest, callback=self.parse_contents)

    def parse_contents(self, response):

        lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
        sContent = ""
        for i in lContent:
            sContent += i
        sContent = sContent.replace("\n", "").replace("\t", "")
        yield {"LinkContent": sContent}

我的代码有什么问题?

推荐答案

yield 不是同步的,你应该使用 meta 来实现这一点.文档:https://doc.scrapy.org/en/latest/主题/请求-响应.html
代码:

yield is not synchronous, you should use meta to achieve this. Doc: https://doc.scrapy.org/en/latest/topics/request-response.html
Code:

import scrapy
class firstSpider(scrapy.Spider):
    name = "ipatranscription"
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']
    def parse(self, response):
        body = response.xpath('./body/div[3]/div[1]/div/a')
        LinkTextSelector = './text()'
        LinkDestSelector = './@href'
        for link in body:
            LinkText = link.xpath(LinkTextSelector).extract_first()
            LinkDest = 
              response.urljoin(link.xpath(LinkDestSelector).extract_first())
            yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText})

    def parse_contents(self, response):
        lContent = 
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
        sContent = ""
        for i in lContent:
            sContent += i
        sContent = sContent.replace("\n", "").replace("\t", "")
        linkText = response.meta['LinkText']
        yield {"LinkContent": sContent,"LinkText": linkText}

这篇关于使scrapy按顺序跟随链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆