无法使我的脚本以正确的方式处理本地创建的服务器响应 [英] Unable to make my script process locally created server response in the right way

查看:48
本文介绍了无法使我的脚本以正确的方式处理本地创建的服务器响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用脚本在本地运行Selenium,以便可以利用蜘蛛中的响应(来自Selenium).

I've used a script to run selenium locally so that I can make use of the response (derived from selenium) within my spider.

这是selenium在本地运行的Web服务:

This is the web service where selenium runs locally:

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
    _driver = None

    @staticmethod
    def getDriver():
        if not Selenium._driver:
            chrome_options = Options()
            chrome_options.add_argument("--headless")

            Selenium._driver = webdriver.Chrome(options=chrome_options)
        return Selenium._driver

    @property
    def driver(self):
        return Selenium.getDriver()

    def get(self):
        url = str(request.args['url'])

        self.driver.get(url)

        return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
    app.run(debug=True)

这是我的小蜘蛛,它利用该响应从网页中解析标题.

This is my scrapy spider which takes the benefit of that response to parse the title from a webpage.

import scrapy
from urllib.parse import quote
from scrapy.crawler import CrawlerProcess

class StackSpider(scrapy.Spider):
    name = 'stackoverflow'
    url = 'https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pageSize=50'
    base = 'https://stackoverflow.com'

    def start_requests(self):
        link = 'http://127.0.0.1:5000/?url={}'.format(quote(self.url))
        yield scrapy.Request(link,callback=self.parse)

    def parse(self, response):
        for item in response.css(".summary .question-hyperlink::attr(href)").getall():
            nlink = self.base + item
            link = 'http://127.0.0.1:5000/?url={}'.format(quote(nlink))
            yield scrapy.Request(link,callback=self.parse_info,dont_filter=True)

    def parse_info(self, response):
        item = response.css('h1[itemprop="name"] > a::text').get()
        yield {"title":item}

if __name__ == '__main__':
    c = CrawlerProcess()
    c.crawl(StackSpider)
    c.start()

问题是上述脚本多次给我相同的标题,然后又给了我另一个标题,依此类推.

The problem is the above script gives me the same title multiple times and then another title and so on.

我应该带来什么麻烦才能使脚本以正确的方式工作?

What possible chage should I bring about to make my script work in the right way?

推荐答案

我同时运行了两个脚本,它们均按预期运行.所以我的发现:

I ran both the scripts, and they run as intended. So my findings :

  1. downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError 未经服务器允许,即在ebay上,没有办法解决此错误.
  2. scrapy的日志:

  1. downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError there is no means to get through this error without permission of the server, here i.e ebay.
  2. Logs from scrapy:

2019-05-25 07:28:41 [scrapy.statscollectors]信息:倾销Scrapy统计信息:{'downloader/exception_count':72,'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError':64,'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived':8,'downloader/request_bytes':55523,"downloader/request_count":81,'downloader/request_method_count/GET':81,'downloader/response_bytes':2448476,'downloader/response_count':9,'downloader/response_status_count/200':9,'finish_reason':'关机','finish_time':datetime.datetime(2019,5,25,1,58,41,234183),'item_scraped_count':8,'log_count/DEBUG':90,'log_count/INFO':9'request_depth_max':1,'response_received_count':9重试/计数":72,'retry/reason_count/twisted.internet.error.ConnectionRefusedError':64,'retry/reason_count/twisted.web._newclient.ResponseNeverReceived':8,调度程序/出队":81,调度程序/出队/内存":81,``调度程序/排队'':131,``调度程序/排队/内存'':131,'start_time':datetime.datetime(2019,5,25,1,56,57,751009)}2019-05-25 07:28:41 [scrapy.core.engine]信息:蜘蛛关闭(关闭)

2019-05-25 07:28:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 72, 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 64, 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8, 'downloader/request_bytes': 55523, 'downloader/request_count': 81, 'downloader/request_method_count/GET': 81, 'downloader/response_bytes': 2448476, 'downloader/response_count': 9, 'downloader/response_status_count/200': 9, 'finish_reason': 'shutdown', 'finish_time': datetime.datetime(2019, 5, 25, 1, 58, 41, 234183), 'item_scraped_count': 8, 'log_count/DEBUG': 90, 'log_count/INFO': 9, 'request_depth_max': 1, 'response_received_count': 9, 'retry/count': 72, 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 64, 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 8, 'scheduler/dequeued': 81, 'scheduler/dequeued/memory': 81, 'scheduler/enqueued': 131, 'scheduler/enqueued/memory': 131, 'start_time': datetime.datetime(2019, 5, 25, 1, 56, 57, 751009)} 2019-05-25 07:28:41 [scrapy.core.engine] INFO: Spider closed (shutdown)

您只能看到已抓取的 8 个项目.这些只是徽标和其他不受限制的东西.

you can see only 8 items scraped. These are just the logos and other unrestricted things.

  1. 服务器日志:

s:// .ebaystatic.com http:// .ebay.com https://*.ebay.com".要么是"unsafe-inline"关键字,要么是一个哈希("sha256-40GZDfucnPVwbvI/Q1ivGUuJtX8krq8jy3tWNrA/n58 ='),或者需要随机数('nonce -...')才能启用内联执行.",来源:

s://.ebaystatic.com http://.ebay.com https://*.ebay.com". Either the 'unsafe-inline' keyword, a hash ('sha256-40GZDfucnPVwbvI/Q1ivGUuJtX8krq8jy3tWNrA/n58='), or a nonce ('nonce-...') is required to enable inline execution. ", source: https://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=323815597324&t=0&tid=10&category=169291&seller=wardrobe-ltd&excSoj=1&excTrk=1&lsite=0&ittenable=false&domain=ebay.com&descgauge=1&cspheader=1&oneClk=1&secureDesc=1 (1)

Ebay不允许您报废.

Ebay does not allow you to scrap itself.

那么如何完成任务>>

So how to complete your task >>

  1. 每次抓取前都要检查同一站点的/robots.txt .对于ebay,它是: http://www.ebay.com/robots.txt 而且您可以看到几乎所有的东西都是不允许的.

  1. Everytime before scraping check /robots.txt for the same site. For ebay its : http://www.ebay.com/robots.txt And you can see almost everything is disallowed.

用户代理:*禁止:/* rt = nc禁止:/b/ LH_禁止:/brw/禁止:/clp/禁止:/clt/store/禁止:/csc/禁止:/ctg/禁止:/ctm/禁止:/dsc/禁止:/edc/禁止:/feed/禁止:/gsr/禁止:/gwc/禁止:/hcp/禁止:/itc/禁止:/lit/禁止:/lst/ng/禁止:/lvx/禁止:/mbf/禁止:/mla/禁止:/mlt/禁止:/myb/禁止:/mys/禁止:/prp/禁止:/rcm/不允许:/sch/%7C禁止:/sch/* LH_禁止:/sch/aop/禁止:/sch/ctg/不允许:/sl/node禁止:/sme/禁止:/soc/禁止:/talk/不允许:/票证/不允许:/今天/禁止:/trylater/不允许:/urw/write-review/禁止:/vsp/禁止:/ws/禁止:/sch/* modules = SEARCH_REFINEMENTS_MODEL_V2不允许:/b/ modules = SEARCH_REFINEMENTS_MODEL_V2禁止:/itm/ _nkw禁止:/itm/?适合禁止:/itm/& fits禁止:/cta/

User-agent: * Disallow: /*rt=nc Disallow: /b/LH_ Disallow: /brw/ Disallow: /clp/ Disallow: /clt/store/ Disallow: /csc/ Disallow: /ctg/ Disallow: /ctm/ Disallow: /dsc/ Disallow: /edc/ Disallow: /feed/ Disallow: /gsr/ Disallow: /gwc/ Disallow: /hcp/ Disallow: /itc/ Disallow: /lit/ Disallow: /lst/ng/ Disallow: /lvx/ Disallow: /mbf/ Disallow: /mla/ Disallow: /mlt/ Disallow: /myb/ Disallow: /mys/ Disallow: /prp/ Disallow: /rcm/ Disallow: /sch/%7C Disallow: /sch/*LH_ Disallow: /sch/aop/ Disallow: /sch/ctg/ Disallow: /sl/node Disallow: /sme/ Disallow: /soc/ Disallow: /talk/ Disallow: /tickets/ Disallow: /today/ Disallow: /trylater/ Disallow: /urw/write-review/ Disallow: /vsp/ Disallow: /ws/ Disallow: /sch/*modules=SEARCH_REFINEMENTS_MODEL_V2 Disallow: /b/modules=SEARCH_REFINEMENTS_MODEL_V2 Disallow: /itm/_nkw Disallow: /itm/?fits Disallow: /itm/&fits Disallow: /cta/

因此请转到 https://developer.ebay.com/api-docs/developer/static/developer-landing.html 并检查其文档,他们的站点中有更简单的示例代码,无需刮板即可获取物品需求.

Therefore go to https://developer.ebay.com/api-docs/developer/static/developer-landing.html and check their docs, there are easier example code in their site to get the items needs without scraping.

这篇关于无法使我的脚本以正确的方式处理本地创建的服务器响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆