用于渲染 javascript 的 scrapy-splash 用法 [英] scrapy-splash usage for rendering javascript

查看:90
本文介绍了用于渲染 javascript 的 scrapy-splash 用法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我上一个问题的后续

我安装了 splash 和 scrapy-splash.

并且还遵循了说明scrapy-splash.

我编辑了我的代码如下:

导入scrapy从scrapy_splash 导入SplashRequest类 CityDataSpider(scrapy.Spider):名称 = "城市数据"def start_requests(self):网址 = ['http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&a​​mp;sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0','http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&a​​mp;sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',]对于网址中的网址:yield SplashRequest(url=url, callback=self.parse)定义解析(自我,响应):page = response.url.split("/")[-2]文件名 = 'citydata-%s.html' % 页面with open(filename, 'wb') as f:f.write(response.body)self.log('保存的文件 %s' % 文件名)

但我仍然得到相同的输出.只生成一个html文件,结果只针对http://www.city-data.com/advanced/search.php

代码中是否有任何错误或任何其他建议.

解决方案

首先,我想澄清@paul trmbrth"所写的上一个问题中可能存在的一些混淆点:

<块引用>

URL 片段(即包括#body 和之后的所有内容)不会发送到服务器,而只会发送http://www.city-data.com/advanced/search.php 被抓取

所以对于 Scrapy 来说,对 [...] 和 [...] 的请求是同一个资源,所以它只获取一次.它们仅在网址片段上有所不同.

URI 标准规定使用数字符号 (#) 表示片段的开始,即 URL 的最后一部分.在大多数/所有浏览器中,除了#"之外不会传输任何内容.然而,AJAX 站点利用 Javascript 的 window.location.hash 抓取 URL 片段并用它来执行额外的 AJAX 调用是相当普遍的.我提出这个问题是因为 city-data.com 正是这样做的,这可能会让您感到困惑,因为它实际上为浏览器中的每个 URL 带回了两个不同的站点.

Scrapy 默认会删除 URL 片段,因此它会将两个 URL 都报告为http://www.city-data.com/advanced/search.php",过滤第二个.

<小时>

除此之外,从 URL 中删除#body"后仍然会出现问题,这是由 page = response.url.split("/")[-2]filename = 'citydata-%s.html' % page.您的 URL 都没有重定向,因此提供的 URL 将填充 response.url 字符串.

隔离它,我们得到以下内容:

<预><代码>>>>网址 = [>>>'http://www.city-data.com/advanced/search.php?fips=0&csize=a&sc=2&a​​mp;sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&;e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',>>>'http://www.city-data.com/advanced/search.php?fips=0&csize=a&sc=2&a​​mp;sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&;e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',>>>]>>>对于网址中的网址:...打印(url.split(/")[-2])先进的先进的

因此,对于这两个 URL,您将提取相同的信息,这意味着当您使用 filename = 'citydata-%s.html' % page 时,您将获得相同的文件名,我假设是citydata-advanced.html".第二次调用时,您将覆盖第一个文件.

根据您对数据的处理方式,您可以将其更改为附加到文件中,或者将文件名变量修改为唯一的值,例如:

from urlparse import urlparse, parse_qs导入scrapy从scrapy_splash 导入SplashRequest类 CityDataSpider(scrapy.Spider):[...]定义解析(自我,响应):page = parse_qs(urlparse(response.url).query).get('p')文件名 = 'citydata-%s.html' % 页面with open(filename, 'wb') as f:f.write(response.body)self.log('保存的文件 %s' % 文件名)

This is a follow up of my previous quesion

I installed splash and scrapy-splash.

And also followed the instructions for scrapy-splash.

I edited my code as follows:

import scrapy
from scrapy_splash import SplashRequest

class CityDataSpider(scrapy.Spider):
    name = "citydata"

    def start_requests(self):
        urls = [
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',
            ]
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'citydata-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

But still i get the same output. only one html file is generated and the result is only for http://www.city-data.com/advanced/search.php

is there anything wrong in the code or any other suggestions please.

解决方案

First off, I wanted to clear up some possible points of confusion from your last question which "@paul trmbrth" wrote:

The URL fragment (i.e. everything including and after #body) is not sent to the server and only http://www.city-data.com/advanced/search.php is fetched

So for Scrapy, the requests to [...] and [...] are the same resource, so it's only fetch once. They differ only in their URL fragments.

URI standards dictate that the number sign (#) be used to indicate the start of the fragment, which is the last part of the URL. In most/all browsers, nothing beyond the "#" is transmitted. However, it's fairly common for AJAX sites to utilize Javascript's window.location.hash grab the URL fragment, and use it to execute additional AJAX calls. I bring this up because city-data.com does exactly this, which may confuse you as it does in fact bring back two different sites for each of those URLs in a browser.

Scrapy does by default drop the URL fragment, so it will report both URLs as being just "http://www.city-data.com/advanced/search.php", and filter the second one.


With all of that out of the way, there will still be problem after you remove "#body" from the URLs caused by a combination of of page = response.url.split("/")[-2] and filename = 'citydata-%s.html' % page. Neither of your URL's redirect, so the URL provided is what will populate the response.url string.

Isolating that, we get the following:

>>> urls = [
>>>     'http://www.city-data.com/advanced/search.php?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',
>>>     'http://www.city-data.com/advanced/search.php?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',
>>> ]
>>> for url in urls:
...     print(url.split("/")[-2])

advanced
advanced

So, for both URL's, you're extracting the same piece of information, which means when you use filename = 'citydata-%s.html' % page you're going to get the same filename, which I assume would be 'citydata-advanced.html'. The second time it's called, you're overwriting the first file.

Depending on what you're doing with the data, you could either change this to append to the file, or modify your filename variable to something unique such as:

from urlparse import urlparse, parse_qs

import scrapy
from scrapy_splash import SplashRequest

class CityDataSpider(scrapy.Spider):

    [...]

    def parse(self, response):
        page = parse_qs(urlparse(response.url).query).get('p')
        filename = 'citydata-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

这篇关于用于渲染 javascript 的 scrapy-splash 用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆