用Scrapy和Splash抓取JS呈现页面的问题 [英] Issue with scraping JS rendered page with Scrapy and Splash

查看:53
本文介绍了用Scrapy和Splash抓取JS呈现页面的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试抓取

,但它呈现顶级主页

解决方案

在这种情况下,良好的起点是查看常见问题解答部分.事实证明,根据您的情况,您需要this page which includes following html according to chrome

<p class="title">

            Orange Paired

        </p>

this is my spider:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = "splash"
    allowed_domains = ["phillips.com"]
    start_urls = ["https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19"]
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse,
                endpoint='render.json',
                args={'har': 1, 'html': 1}
            )
    def parse(self, response):
        print("1. PARSED", response.real_url, response.url)
        print("2. ",response.css("title").extract())
        print("3. ",response.data["har"]["log"]["pages"])
        print("4. ",response.headers.get('Content-Type'))
        print("5. ",response.xpath('//p[@class="title"]/text()').extract())

This is the output of scrapy runspider spiders/splash_spider.py

2017-08-31 09:48:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
1. PARSED http://localhost:8050/render.json https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19
2.  ['<title>PHILLIPS : Bridget Riley, Orange Paired</title>', '<title>Page 1</title>']
3.  [{'title': 'PHILLIPS : Bridget Riley, Orange Paired', 'pageTimings': {'onContentLoad': 3832, '_onStarted': 1, '_onIframesRendered': 4667, 'onLoad': 4664, '_onPrepareStart': 4664}, 'id': '1', 'startedDateTime': '2017-08-31T07:48:18.986240Z'}]
4.  b'text/html; charset=utf-8'
5.  []
2017-08-31 09:48:23 [scrapy.core.engine] INFO: Closing spider (finished)

Why am I getting an empty output for 5?

What I also don't understand is that Splash doesn't seem to render the page linked above

but it renders the top level homepage

解决方案

Good starting point in such cases is to look at FAQ section of Splash documentation. It turns out that in your case you need to disable Private mode for Splash, either via --disable-private-mode startup option for Docker, or by setting splash.private_mode_enabled = false in your LUA script.

Once you disable Private mode, the page renders correctly.

这篇关于用Scrapy和Splash抓取JS呈现页面的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆