Scrapy:从源及其链接中提取数据 [英] Scrapy: Extracting data from source and its links
问题描述
编辑问题以链接到原始:
Edited question to link to original:
来自链接 https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html
我试图从主表中获取信息以及表中其他 2 个链接中的数据.我设法从一个链接中提取,但问题是转到另一个链接并将数据附加到一行中.
I am trying to get info from the main table as well as the data within the other 2 links within the table. I managed to pull from one, but question is going to the other link and appending the data in one line.
from urlparse import urljoin
import scrapy
from texasdeath.items import DeathItem
class DeathItem(Item):
firstName = Field()
lastName = Field()
Age = Field()
Date = Field()
Race = Field()
County = Field()
Message = Field()
Passage = Field()
class DeathSpider(scrapy.Spider):
name = "death"
allowed_domains = ["tdcj.state.tx.us"]
start_urls = [
"http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
]
def parse(self, response):
sites = response.xpath('//table/tbody/tr')
for site in sites:
item = DeathItem()
item['firstName'] = site.xpath('td[5]/text()').extract()
item['lastName'] = site.xpath('td[4]/text()').extract()
item['Age'] = site.xpath('td[7]/text()').extract()
item['Date'] = site.xpath('td[8]/text()').extract()
item['Race'] = site.xpath('td[9]/text()').extract()
item['County'] = site.xpath('td[10]/text()').extract()
url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())
url2 = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
if url.endswith("html"):
request = scrapy.Request(url, meta={"item": item,"url2" : url2}, callback=self.parse_details)
yield request
else:
yield item
def parse_details(self, response):
item = response.meta["item"]
url2 = response.meta["url2"]
item['Message'] = response.xpath("//p[contains(text(), 'Last Statement')]/following-sibling::p/text()").extract()
request = scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
return request
def parse_details2(self, response):
item = response.meta["item"]
item['Passage'] = response.xpath("//p/text()").extract_first()
return item
我了解我们如何将参数传递给请求和元数据.但仍不清楚流程,此时我不确定这是否可能.我查看了几个示例,包括以下示例:
I understand how we pass arguments to a request and meta. But still unclear of the flow, at this point I am unsure whether this is possible or not. I have viewed several examples including the ones below:
我如何使用多个请求并在它们之间传递项目在scrapy python中
从技术上讲,数据将反映主表,两个链接都包含来自其链接内的数据.
Technically the data will reflect the main table just with both links containing data from within its link.
感谢任何帮助或指导.
推荐答案
这个案例的问题出在这段代码
The problem in this case is in this piece of code
if url.endswith("html"):
yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)
else:
yield item
if url2.endswith("html"):
yield scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
else:
yield item
通过请求一个链接,您正在创建一个新的线程",它将走自己的生命历程,因此,函数 parse_details 将无法查看 parse_details2 中正在执行的操作,我会这样做的方法是在其中调用一个彼此这样
By requesting a link you are creating a new "thread" that will take its own course of life so, the function parse_details wont be able to see what is being done in parse_details2, the way I would do it is call one within each other this way
url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
url2 = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first()
if url.endswith("html"):
request=scrapy.Request(url, callback=self.parse_details)
request.meta['item']=item
request.meta['url2']=url2
yield request
elif url2.endswith("html"):
request=scrapy.Request(url2, callback=self.parse_details2)
request.meta['item']=item
yield request
else:
yield item
def parse_details(self, response):
item = response.meta["item"]
url2 = response.meta["url2"]
item['About Me'] = response.xpath("//p[contains(text(), 'About Me')]/following-sibling::p/text()").extract()
if url2:
request=scrapy.Request(url2, callback=self.parse_details2)
request.meta['item']=item
yield request
else:
yield item
此代码尚未经过彻底测试,请在测试时进行评论
This code hasn't been tested thoroughly so comment as you test
这篇关于Scrapy:从源及其链接中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!