Scrapy Splash 不会执行 lua 脚本 [英] Scrapy Splash won't execute lua script

查看:36
本文介绍了Scrapy Splash 不会执行 lua 脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了 Lua 脚本拒绝执行的问题.从 ScrapyRequest 调用返回的响应似乎是一个 HTML 正文,而我期待一个文档标题.我假设 Lua 脚本永远不会被调用,因为它似乎对响应没有明显影响.我已经通过文档挖掘了很多,似乎无法弄清楚这里缺少什么.有人有什么建议吗?

I have ran across an issue in which my Lua script refuses to execute. The returned response from the ScrapyRequest call seems to be an HTML body, while i'm expecting a document title. I am assuming that the Lua script is never being called as it seems to have no apparent effect on the response. I have dug a lot through the documentation and can't quite seem to figure out what is missing here. Does anyone have any suggestions?

from urlparse import urljoin

import scrapy
from scrapy_splash import SplashRequest


GOOGLE_BASE_URL = 'https://www.google.com/'
GOOGLE_QUERY_PARAMETERS = '#q={query}'
GOOGLE_SEARCH_URL = urljoin(GOOGLE_BASE_URL, GOOGLE_QUERY_PARAMETERS)

GOOGLE_SEARCH_QUERY = 'example search query'


LUA_SCRIPT = """
function main(splash)
    assert(splash:go(splash.args.url))
    return splash:evaljs("document.title")
end
"""

SCRAPY_CRAWLER_NAME = 'google_crawler'
SCRAPY_SPLASH_ENDPOINT = 'render.html'
SCRAPY_ARGS = {
    'lua_source': LUA_SCRIPT
}


def get_search_url(query):
    return GOOGLE_SEARCH_URL.format(query=query)


class GoogleCrawler(scrapy.Spider):
    name=SCRAPY_CRAWLER_NAME
    search_url = get_search_url(GOOGLE_SEARCH_QUERY)

    def start_requests(self):

        response = SplashRequest(self.search_url,
            self.parse, endpoint=SPLASH_ENDPOINT, args=SCRAPY_ARGS)

        yield response


    def parse(self, response):
        doc_title = response.body_as_unicode()
        print doc_title

推荐答案

SplashRequest的'endpoint'参数必须是'execute'才能执行Lua脚本;在示例中它是render.html".

'endpoint' argument of SplashRequest must be 'execute' in order to execute a Lua script; it is 'render.html' in the example.

这篇关于Scrapy Splash 不会执行 lua 脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆