使用 Scrapy 和 Splash 抓取 JS 渲染页面的问题 [英] Issue with scraping JS rendered page with Scrapy and Splash
问题描述
我正在尝试抓取
但它呈现顶级主页
在这种情况下,好的起点是查看 常见问题解答 Splash 文档部分.事实证明,在您的情况下,您需要 禁用 Splash 的私有模式,通过 Docker 的 --disable-private-mode
启动选项,或通过在你的文件中设置 splash.private_mode_enabled = false
LUA 脚本.
禁用隐私模式后,页面将正确呈现.
I'm trying to scrape this page which includes following html according to chrome
<p class="title">
Orange Paired
</p>
this is my spider:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "splash"
allowed_domains = ["phillips.com"]
start_urls = ["https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
endpoint='render.json',
args={'har': 1, 'html': 1}
)
def parse(self, response):
print("1. PARSED", response.real_url, response.url)
print("2. ",response.css("title").extract())
print("3. ",response.data["har"]["log"]["pages"])
print("4. ",response.headers.get('Content-Type'))
print("5. ",response.xpath('//p[@class="title"]/text()').extract())
This is the output of scrapy runspider spiders/splash_spider.py
2017-08-31 09:48:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
1. PARSED http://localhost:8050/render.json https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19
2. ['<title>PHILLIPS : Bridget Riley, Orange Paired</title>', '<title>Page 1</title>']
3. [{'title': 'PHILLIPS : Bridget Riley, Orange Paired', 'pageTimings': {'onContentLoad': 3832, '_onStarted': 1, '_onIframesRendered': 4667, 'onLoad': 4664, '_onPrepareStart': 4664}, 'id': '1', 'startedDateTime': '2017-08-31T07:48:18.986240Z'}]
4. b'text/html; charset=utf-8'
5. []
2017-08-31 09:48:23 [scrapy.core.engine] INFO: Closing spider (finished)
Why am I getting an empty output for 5?
What I also don't understand is that Splash doesn't seem to render the page linked above
but it renders the top level homepage
Good starting point in such cases is to look at FAQ section of Splash documentation. It turns out that in your case you need to disable Private mode for Splash, either via --disable-private-mode
startup option for Docker, or by setting splash.private_mode_enabled = false
in your LUA script.
Once you disable Private mode, the page renders correctly.
这篇关于使用 Scrapy 和 Splash 抓取 JS 渲染页面的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!