使用scrapy提取链接内的数据 [英] using scrapy extracting data inside links
问题描述
我一直试图从消费者投诉中提取数据.标题中的数据和这些标题链接中的数据.我编写了以下代码,但无法解析链接并提取数据,也无法提取所有链接相关.plz指南
I have been trying to extract data from consumercomplaints.in the title and the data inside those title links.I wrote the following code and unable to parse through the links and extract the data and also I am unable to extract all the links related.plz guide
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from comp.items import CompItem
class criticspider(CrawlSpider):
name ="comp"
allowed_domains =["consumercomplaints.in"]
#start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
rules=(
Rule(SgmlLinkExtractor(allow=("search=delhivery&page=1/+",)), callback="parse", follow=True),
#Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse_health",follow=True),
)
def parse(self,response):
hxs = Selector(response)
sites = hxs.select('//table[@width="100%"]')
items = []
for site in sites:
item = CompItem()
item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
item['link'] = site.select('.//td[@class="complaint"]/a/@href').extract()
if item['link']:
if 'http://' not in item['link']:
item['link'] = urljoin(response.url, item['link'])
yield Request(item['link'],
meta={'item': item},
callback=self.anchor_page)
# item['intro'] = site.select('.//td[@class="small"]//a[2]/text()').extract()
# item['heading'] = site.select('.//td[@class="compl-text"]/div/b[1]/text()').extract()
# item['date'] = site.select('.//td[@class="small"]/text()[2]').extract()
# item['complaint'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
items.append(item)
def anchor_page(self, response):
hxs = Selector(response)
old_item = response.request.meta['item'] # Receiving parse Method item that was in Request meta
# parse some more values
#place them in old_item
#e.g
old_item['data']=hxs.select('.//td[@class="compl-text"]/div/text()').extract()
yield old_item
推荐答案
您使用的是旧版本的 Scrapy 吗?
Are you using an old version of Scrapy?
在最新的稳定版本中,您不需要执行 hxs = Selector(response)
也不需要使用 hxs.select()
方法.你可以用 response.xpath()
做同样的事情.
In the latest stable version you don't need to do hxs = Selector(response)
nor using the hxs.select()
method. You can do the same thing just with response.xpath()
.
我认为您代码中的问题在于 select()
(或 response.xpath
)的结果实际上是一个 Python list
,所以你需要做:
I think the problem in your code is that the result of select()
(or response.xpath
) is actually a Python list
, so you need to do:
link = site.select('.//td[@class="complaint"]/a/@href').extract()
if link:
item['link'] = link[0]
您可能也想为标题做类似的事情.
You probably want to do a similar thing for title too.
我进行了一些更改:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
data = scrapy.Field()
class criticspider(CrawlSpider):
name = "comp"
allowed_domains = ["consumercomplaints.in"]
start_urls = ["http://www.consumercomplaints.in/?search=delhivery"]
rules = (
Rule(
SgmlLinkExtractor(allow=("search=delhivery&page=1/+",)),
callback="parse",
follow=True),
)
def parse(self, response):
sites = response.xpath('//table[@width="100%"]')
items = []
for site in sites:
item = CompItem()
item['title'] = site.xpath('.//td[@class="complaint"]/a/span/text()').extract()[0]
item['link'] = site.xpath('.//td[@class="complaint"]/a/@href').extract()[0]
if item['link']:
if 'http://' not in item['link']:
item['link'] = urljoin(response.url, item['link'])
yield scrapy.Request(item['link'],
meta={'item': item},
callback=self.anchor_page)
items.append(item)
def anchor_page(self, response):
old_item = response.request.meta['item']
old_item['data'] = response.xpath('.//td[@class="compl-text"]/div/text()').extract()
yield old_item
这篇关于使用scrapy提取链接内的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!