Scrapy 修改链接以包含域名 [英] Scrapy Modify Link to include Domain Name

查看:42
本文介绍了Scrapy 修改链接以包含域名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目,item['link'],形式如下:

I have an item, item['link'], of this form:

item['link'] = site.select('div[2]/div/h3/a/@href').extract()

它提取的链接是这种形式:

The links it extracts are of this form :

'link': [u'/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'],

我希望他们这样:

'link': [u'http://www.youtube.com/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'],

是否可以直接在scrapy中执行此操作,而不是事后重新编辑列表?

Is it possible to do this directly, in scrapy, instead of reediting the list afterwards?

推荐答案

是的,每次获取链接时,我都必须使用 urlparse.urljoin 方法.

Yeah, everytime I'm grabbing a link I have to use the method urlparse.urljoin.

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)

我想象你试图抓取整个 url 来解析它,对吗?如果是这种情况,一个简单的两种方法系统将适用于 basespider.parse 方法找到链接,将其发送到 parse_url 方法,该方法将您提取的内容输出到管道

I imagine your trying to grab the entire url to parse it right? if that's the case a simple two method system would work on a basespider. the parse method finds the link, sends it to the parse_url method which outputs what you're extracting to the pipeline

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)


def parse_url(self, response):
   hxs = HtmlXPathSelector(response)
   item = ZipgrabberItem()
   item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this grabs it
   return item 

这篇关于Scrapy 修改链接以包含域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆