Scrapy linkextractor 忽略符号 # 后面的参数,因此不会跟随链接 [英] Scrapy linkextractor ignores parameters behind the sign # and thus will not follow the link

查看:35
本文介绍了Scrapy linkextractor 忽略符号 # 后面的参数,因此不会跟随链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scrapy抓取一个网站,其中分页位于符号#"后面.这以某种方式使scrapy 忽略该字符后面的所有内容,并且它始终只会看到第一页.

I am trying to crawl a website with scrapy where the pagination is behind the sign "#". This somehow makes scrapy ignore everything behind that character and it will always only see the first page.

例如:

http://www.rolex.de/de/watches/find-rolex.html#g=1&p=2

如果您手动输入问号,网站将加载第 1 页

If you enter a question mark manually, the site will load page 1

http://www.rolex.de/de/watches/find-rolex.html?p=2

scrapy 的统计数据告诉我它获取了第一页:

The stats from scrapy tell me it fetched the first page:

DEBUG: Crawled (200) http://www.rolex.de/de/watches/datejust/m126334-0014.html>(参考:http://www.rolex.de/de/watches/find-rolex.html)

DEBUG: Crawled (200) http://www.rolex.de/de/watches/datejust/m126334-0014.html> (referer: http://www.rolex.de/de/watches/find-rolex.html)

我的爬虫看起来像这样:

My crawler looks like this:

start_urls = [
    'http://www.rolex.de/de/watches/find-rolex.html#g=1',
    'http://www.rolex.de/de/watches/find-rolex.html#g=0&p=2',
    'http://www.rolex.de/de/watches/find-rolex.html#g=0&p=3',
]

rules = (
    Rule(
        LinkExtractor(allow=['.*/de/watches/.*/m\d{3,}.*.\.html']), 
        callback='parse_item'
    ),       
    Rule(
        LinkExtractor(allow=['.*/de/watches/find-rolex(/.*)?\.html#g=1(&p=\d*)?$']), 
        follow=True
    ),
)

如何让scrapy 忽略url 中的# 并访问给定的URL?

How can I make scrapy ignore the # inside the url and visit the given URL?

推荐答案

Scrapy 执行 HTTP 请求.URL 中#"之后的数据不是 HTTP 请求的一部分,而是由 JavaScript 使用.

Scrapy performs HTTP requests. The data after '#' in a URL is not part of an HTTP request, it is used by JavaScript.

正如评论中所建议的,该站点使用 AJAX 加载数据.

As suggested in the comments, the site loads the data using AJAX.

此外,它在 AJAX 中不使用分页:该站点在单个请求中以 JSON 格式下载整个手表列表,然后使用 JavaScript 完成分页.

Moreover, it does not use pagination in AJAX: the site downloads the whole list of watches as JSON in a single request, and then the pagination is done using JavaScript.

因此,您只需使用 Web 浏览器的开发人员工具的网络"选项卡即可查看获取 JSON 数据的请求,并执行类似的请求,而不是请求 HTML 页面.

So, you can just use the Network tab of the developer tools of your web browser to see the request that obtains the JSON data, and perform a similar request instead of requesting the HTML page.

但是请注意,您不能将 LinkExtractor 用于 JSON 数据.您应该简单地使用 Python 的 json 解析响应并在那里迭代 URL.

Note, however, that you cannot use LinkExtractor for JSON data. You should simply parse the response with Python’s json and iterate the URLs there.

这篇关于Scrapy linkextractor 忽略符号 # 后面的参数,因此不会跟随链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆