避免由于相对 url 导致的错误请求 [英] Avoid bad requests due to relative urls
问题描述
我正在尝试使用 Scrapy 抓取一个网站,我想抓取的每个页面的 url 都是使用这种相对路径编写的:
I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
<a href="../../en/item-to-scrap.html">Link</a>
现在,在我的浏览器中,这些链接有效,您可以访问像 https://www.domain-name.com/en/item-to-scrap.html(尽管相对路径在层次结构中返回两次而不是一次)
Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once)
但是我的 CrawlSpider 无法将这些网址转换为正确"的网址,而我得到的只是那种错误:
But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
有没有办法解决这个问题,或者我遗漏了什么?
Is there a way to fix this, or am I missing something?
这是我的蜘蛛代码,相当基本(基于匹配/en/item-*-scrap.html"的项目网址):
Here is my spider's code, fairly basic (on the basis of item urls matching "/en/item-*-scrap.html") :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
推荐答案
感谢这个答案,我终于找到了解决方案.我使用 process_links 如下:
I finally found a solution thanks to this answer. I used process_links as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links
这篇关于避免由于相对 url 导致的错误请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!