Scrapy Splash屏幕截图? [英] Scrapy Splash Screenshots?
问题描述
我正在尝试抓取一个网站,同时截取每个页面的屏幕截图.到目前为止,我已经设法将以下代码组合在一起:
I'm trying to scrape a site whilst taking a screenshot of every page. So far, I have managed to piece together the following code:
import json
import base64
import scrapy
from scrapy_splash import SplashRequest
class ExtractSpider(scrapy.Spider):
name = 'extract'
def start_requests(self):
url = 'https://stackoverflow.com/'
splash_args = {
'html': 1,
'png': 1
}
yield SplashRequest(url, self.parse_result, endpoint='render.json', args=splash_args)
def parse_result(self, response):
png_bytes = base64.b64decode(response.data['png'])
imgdata = base64.b64decode(png_bytes)
filename = 'some_image.png'
with open(filename, 'wb') as f:
f.write(imgdata)
它可以很好地进入站点(例如stackoverflow)并返回png_bytes的数据,但是当写入文件时-返回损坏的图像(不加载).
It gets onto the site fine (example, stackoverflow) and returns data for png_bytes, but when written to a file - returns a broken image (doesn't load).
是否可以解决此问题,或者找到更有效的解决方案?我已经读过Splash Lua脚本可以做到这一点,但一直没有找到实现它的方法.谢谢.
Is there a way to fix this, or alternatively find a more efficient solution? I have read that Splash Lua Scripts can do this, but have been unable to find a way to implement this. Thanks.
推荐答案
您要从base64解码两次:
You are decoding from base64 twice:
png_bytes = base64.b64decode(response.data['png'])
imgdata = base64.b64decode(png_bytes)
只需:
def parse_result(self, response):
imgdata = base64.b64decode(response.data['png'])
filename = 'some_image.png'
with open(filename, 'wb') as f:
f.write(imgdata)
这篇关于Scrapy Splash屏幕截图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!