Scrapy:从源及其链接中提取数据 [英] Scrapy: Extracting data from source and its links

查看:53
本文介绍了Scrapy:从源及其链接中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑问题以链接到原始:

Edited question to link to original:

从表中的链接中抓取数据

来自链接 https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html

我试图从主表中获取信息以及表中其他 2 个链接中的数据.我设法从一个链接中提取,但问题是转到另一个链接并将数据附加到一行中.

I am trying to get info from the main table as well as the data within the other 2 links within the table. I managed to pull from one, but question is going to the other link and appending the data in one line.

from urlparse import urljoin

import scrapy

from texasdeath.items import DeathItem

class DeathItem(Item):
    firstName = Field()
    lastName = Field()
    Age = Field()
    Date = Field()
    Race = Field()
    County = Field()
    Message = Field()
    Passage = Field()

class DeathSpider(scrapy.Spider):
    name = "death"
    allowed_domains = ["tdcj.state.tx.us"]
    start_urls = [
        "http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
    ]

    def parse(self, response):
        sites = response.xpath('//table/tbody/tr')
        for site in sites:
            item = DeathItem()

            item['firstName'] = site.xpath('td[5]/text()').extract()
            item['lastName'] = site.xpath('td[4]/text()').extract()
            item['Age'] = site.xpath('td[7]/text()').extract()
            item['Date'] = site.xpath('td[8]/text()').extract()
            item['Race'] = site.xpath('td[9]/text()').extract()
            item['County'] = site.xpath('td[10]/text()').extract()

            url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())
            url2 = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
            if url.endswith("html"):
                request = scrapy.Request(url, meta={"item": item,"url2" : url2}, callback=self.parse_details)
                yield request
            else:
                yield item
def parse_details(self, response):
    item = response.meta["item"]
    url2 = response.meta["url2"]
    item['Message'] = response.xpath("//p[contains(text(), 'Last Statement')]/following-sibling::p/text()").extract()
    request = scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
    return request

def parse_details2(self, response):
    item = response.meta["item"]
    item['Passage'] = response.xpath("//p/text()").extract_first()
    return item

我了解我们如何将参数传递给请求和元数据.但仍不清楚流程,此时我不确定这是否可能.我查看了几个示例,包括以下示例:

I understand how we pass arguments to a request and meta. But still unclear of the flow, at this point I am unsure whether this is possible or not. I have viewed several examples including the ones below:

使用scrapy提取链接内的数据

我如何使用多个请求并在它们之间传递项目在scrapy python中

从技术上讲,数据将反映主表,两个链接都包含来自其链接内的数据.

Technically the data will reflect the main table just with both links containing data from within its link.

感谢任何帮助或指导.

推荐答案

这个案例的问题出在这段代码

The problem in this case is in this piece of code

if url.endswith("html"):
        yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)
    else:
        yield item

if url2.endswith("html"):
        yield scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
    else:
        yield item

通过请求一个链接,您正在创建一个新的线程",它将走自己的生命历程,因此,函数 parse_details 将无法查看 parse_details2 中正在执行的操作,我会这样做的方法是在其中调用一个彼此这样

By requesting a link you are creating a new "thread" that will take its own course of life so, the function parse_details wont be able to see what is being done in parse_details2, the way I would do it is call one within each other this way

url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())

url2 = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first()

if url.endswith("html"):
    request=scrapy.Request(url, callback=self.parse_details)
    request.meta['item']=item
    request.meta['url2']=url2
    yield request
elif url2.endswith("html"):
    request=scrapy.Request(url2, callback=self.parse_details2)
    request.meta['item']=item
    yield request

else:
    yield item


def parse_details(self, response):
    item = response.meta["item"]
    url2 = response.meta["url2"]
    item['About Me'] = response.xpath("//p[contains(text(), 'About Me')]/following-sibling::p/text()").extract()
    if url2:
        request=scrapy.Request(url2, callback=self.parse_details2)
        request.meta['item']=item
        yield request
    else:
        yield item

此代码尚未经过彻底测试,请在测试时进行评论

This code hasn't been tested thoroughly so comment as you test

这篇关于Scrapy:从源及其链接中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆