扭曲的 Python 失败 - Scrapy 问题 [英] Twisted Python Failure - Scrapy Issues

查看:36
本文介绍了扭曲的 Python 失败 - Scrapy 问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 SCRAPY 为任何搜索查询抓取本网站的搜索请求 - http://www.bewakoof.com .

该网站使用 AJAX(以 XHR 的形式)来显示搜索结果.我设法跟踪了 XHR,您会在我的代码中注意到它,如下所示(在 for 循环中,其中我将 URL 存储到 temp,并在循环中递增 'i')-:

fromtwisted.internet 进口反应堆从scrapy.crawler 导入CrawlerProcess, CrawlerRunner导入scrapy从 scrapy.utils.log 导入 configure_logging从 scrapy.utils.project 导入 get_project_settings从scrapy.settings 导入设置导入日期时间从多处理导入进程,队列导入操作系统从scrapy.http导入请求从scrapy导入信号从scrapy.xlib.pydispatch 导入调度器从 scrapy.signalmanager 导入 SignalManager进口重新查询='衬衫'query1=query.replace(" ", "+")类 DmozItem(scrapy.Item):产品名称 = scrapy.Field()product_link = scrapy.Field()current_price = scrapy.Field()mrp = scrapy.Field()报价=scrapy.Field()imageurl = scrapy.Field()outofstock_status = scrapy.Field()类 DmozSpider(scrapy.Spider):名称 = "dmoz"allowed_domains = ["http://www.bewakoof.com"]def start_requests(self):task_urls = []我=1对于范围(1,2)中的我:temp=( "http://www.bewakoof.com/search/searchload/search_text/" + 查询 + "/page_num/" + str(i) )task_urls.append(temp)我=我+1start_urls = (task_urls)p=len(task_urls)打印嗨"返回 [Request(url = start_url) for start_urls in start_urls ]打印嗨"定义解析(自我,响应):打印嗨"打印响应项目 = []对于 response.xpath('//html/body/div[@class="main-div-of-product-item"]') 中的 sel:item = DmozItem()item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]item['current_price']='Rs.' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]item['mrp'] = item['current_price']item['offer'] = str('没有额外的报价可用')item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]item['outofstock_status'] = str('In Stock')items.append(item)蜘蛛 1 = DmozSpider()设置 = 设置()settings.set(项目",dmoz")settings.set("DOWNLOAD_DELAY" , 5)爬虫 = CrawlerProcess(设置)crawler.crawl(spider1)crawler.start()

现在,当我执行此操作时,出现意外错误-:

2015-07-09 11:46:01 [scrapy] INFO:Scrapy 1.0.0 启动(机器人:scrapybot)2015-07-09 11:46:01 [scrapy] 信息:可用的可选功能:ssl、http112015-07-09 11:46:01 [scrapy] 信息:覆盖设置:{'DOWNLOAD_DELAY': 5}2015-07-09 11:46:02 [scrapy] 信息:启用扩展:CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState2015-07-09 11:46:02 [scrapy] 信息:启用下载器中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、ChunkedsTransferMiddleware2015-07-09 11:46:02 [scrapy] 信息:启用蜘蛛中间件:HttpErrorMiddleware、OffsiteMiddleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware2015-07-09 11:46:02 [scrapy] 信息:启用项目管道:你好2015-07-09 11:46:02 [scrapy] 信息:蜘蛛打开2015-07-09 11:46:02 [scrapy] 信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)2015-07-09 11:46:02 [scrapy] 调试:Telnet 控制台监听 127.0.0.1:60232015-07-09 11:46:03 [scrapy] 调试:重试 <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 1 次):[]2015-07-09 11:46:09 [scrapy] 调试:重试 <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 2 次):[]2015-07-09 11:46:13 [scrapy] 调试:放弃重试<GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 3 次):[]2015-07-09 11:46:13 [scrapy] 错误:下载错误<GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>:[<twisted.python.failure.Failure ]2015-07-09 11:46:13 [scrapy] 信息:关闭蜘蛛(已完成)2015-07-09 11:46:13 [scrapy] 信息:倾销 Scrapy 统计数据:{'下载者/异常计数':3,'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,'下载器/请求字节':780,'下载者/请求计数':3,'下载器/request_method_count/GET': 3,'finish_reason': '完成','finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446),'log_count/DEBUG': 4,日志计数/错误":1,'log_count/INFO': 7,'调度程序/出队':3,调度程序/出队/内存":3,'调度程序/排队':3,调度程序/入队/内存":3,'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)}2015-07-09 11:46:13 [scrapy] INFO:Spider 关闭(已完成)

如果你正确地看到了我的代码,我也设置了 DOWNLOAD_DELAY=5,但它仍然给出与我没有保留它时相同的错误.我还增加了 DOWNLOAD_DELAY=10,但仍然出现相同的错误.我在 Stack Overflow 和 GitHub 上阅读了许多与此相关的问题,但似乎没有一个有帮助.

我阅读了其中一个答案,即 TOR with Polipo 可以提供帮助.但是,我对使用它有点怀疑,因为我不知道将TOR与Polipo结合使用Scrapy抓取网站是否合法?(我不想遇到任何法律问题.)这就是我不喜欢使用它的原因.因此,如果合法,请使用 TOR 和 POLIPO 提供我的特定案例的代码.

或者更确切地说,如果这是非法的,请帮助我解决它而不使用它们.

请帮我解决这些错误!

这是我更新的代码-:

fromtwisted.internet 进口反应堆从scrapy.crawler 导入CrawlerProcess, CrawlerRunner导入scrapy从 scrapy.utils.log 导入 configure_logging从 scrapy.utils.project 导入 get_project_settings从scrapy.settings 导入设置导入日期时间从多处理导入进程,队列导入操作系统从scrapy.http导入请求从scrapy导入信号从scrapy.xlib.pydispatch 导入调度器从 scrapy.signalmanager 导入 SignalManager进口重新查询='衬衫'query1=query.replace(" ", "+")类 DmozItem(scrapy.Item):产品名称 = scrapy.Field()product_link = scrapy.Field()current_price = scrapy.Field()mrp = scrapy.Field()报价=scrapy.Field()imageurl = scrapy.Field()outofstock_status = scrapy.Field()类 DmozSpider(scrapy.Spider):名称 = "dmoz"allowed_domains = ["http://www.bewakoof.com"]def _monkey_patching_HTTPClientParser_statusReceived(self):从scrapy.xlib.tx._newclient 导入HTTPClientParser,ParseErrorold_sr = HTTPClientParser.statusReceiveddef statusReceived(self, status):尝试:返回 old_sr(self, status)除了 ParseError,e:如果 e.args[0] == '零件数错误':返回 old_sr(self, status + 'OK')增加statusReceived.__doc__ == old_sr.__doc__HTTPClientParser.statusReceived = statusReceiveddef start_requests(self):task_urls = []我=1对于范围(1,2)中的我:temp = "http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1"task_urls.append(temp)我=我+1start_urls = (task_urls)p=len(task_urls)打印嗨"self._monkey_patching_HTTPClientParser_statusReceived()返回 [Request(url = start_url) for start_urls in start_urls ]打印嗨"定义解析(自我,响应):打印嗨"打印响应项目 = []对于 response.xpath('//html/body/div[@class="main-div-of-product-item"]') 中的 sel:item = DmozItem()item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]item['current_price']='Rs.' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]item['mrp'] = item['current_price']item['offer'] = str('没有额外的报价可用')item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]item['outofstock_status'] = str('In Stock')items.append(item)打印(项目)蜘蛛 1 = DmozSpider()设置 = 设置()settings.set(项目",dmoz")settings.set("DOWNLOAD_DELAY" , 5)爬虫 = CrawlerProcess(设置)crawler.crawl(spider1)crawler.start()

这是我更新后的输出,显示在终端上-:

2015-07-10 13:06:00 [scrapy] INFO:Scrapy 1.0.0 启动(机器人:scrapybot)2015-07-10 13:06:00 [scrapy] 信息:可用的可选功能:ssl、http112015-07-10 13:06:00 [scrapy] 信息:覆盖设置:{'DOWNLOAD_DELAY': 5}2015-07-10 13:06:01 [scrapy] 信息:启用扩展:CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState2015-07-10 13:06:01 [scrapy] 信息:启用下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、ChunkedsTransferMiddleware2015-07-10 13:06:01 [scrapy] 信息:启用蜘蛛中间件:HttpErrorMiddleware、OffsiteMiddleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware2015-07-10 13:06:01 [scrapy] 信息:启用项目管道:你好2015-07-10 13:06:01 [scrapy] 信息:蜘蛛打开2015-07-10 13:06:01 [scrapy] 信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)2015-07-10 13:06:01 [scrapy] 调试:Telnet 控制台监听 127.0.0.1:60232015-07-10 13:06:02 [scrapy] 调试:重试 <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 1 次):[]2015-07-10 13:06:08 [scrapy] 调试:重试 <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 2 次):[]2015-07-10 13:06:12 [scrapy] 调试:放弃重试<GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 3 次):[]2015-07-10 13:06:12 [scrapy] 错误:下载错误<GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>:[<twisted.python.failure.Failure ]2015-07-10 13:06:13 [scrapy] 信息:关闭蜘蛛(已完成)2015-07-10 13:06:13 [scrapy] 信息:倾倒 Scrapy 统计数据:{'下载者/异常计数':3,'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,'下载器/请求字节':780,'下载者/请求计数':3,'下载器/request_method_count/GET': 3,'finish_reason': '完成','finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023),'log_count/DEBUG': 4,日志计数/错误":1,'log_count/INFO': 7,'调度程序/出队':3,调度程序/出队/内存":3,'调度程序/排队':3,调度程序/入队/内存":3,'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)}2015-07-10 13:06:13 [scrapy] 信息:蜘蛛关闭(已完成)

所以,如您所见,错误仍然相同!:(.所以,请帮我解决这个问题!

更新-:

这是我尝试捕获@JoeLinux 建议执行的异常时的输出:

<预><代码>>>>尝试:... fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")... 除了 e 的异常:... e...2015-07-10 17:51:13 [scrapy] 调试:重试 <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 1 次):[]2015-07-10 17:51:14 [scrapy] 调试:重试 <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 2 次):[]2015-07-10 17:51:15 [scrapy] DEBUG:放弃重试<GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>(失败 3 次):[]ResponseFailed([],)>>>打印 e.reasons[0].getTraceback()回溯(最近一次调用最后一次):_doReadOrWrite 中的文件/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py",第 614 行为什么 = selectable.doRead()文件/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py",第 214 行,在 doRead返回 self._dataReceived(data)_dataReceived 中的文件/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py",第 220 行rval = self.protocol.dataReceived(data)文件/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py",第 114 行,在 dataReceived 中返回 self._wrappedProtocol.dataReceived(data)--- <在此处捕获异常>---文件/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py",第 1523 行,在 dataReceived 中self._parser.dataReceived(bytes)文件/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py",第382行,在dataReceived中HTTPParser.dataReceived(self, data)文件/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py",第571行,在dataReceived中为什么 = self.lineReceived(line)文件/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py",第271行,在lineReceived中self.statusReceived(行)文件/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py",第 409 行,在 statusReceived 中raise ParseError("错误的零件数量",状态)Twisted.web._newclient.ParseError: ('错误的零件数量','HTTP/1.1 500')

解决方案

我遇到了同样的错误

[]

现在它可以工作了.

我想你可以试试这个:

  • 在方法_monkey_patching_HTTPClientParser_statusReceived中,将从scrapy.xlib.tx._newclient import HTTPClientParser, ParseError更改为从twisted.web._newclient import HTTPClientParser,解析错误;

  • 在方法 start_requests 中,为 start_urls 中的每个请求调用 _monkey_patching_HTTPClientParser_statusReceived,例如:<代码>def start_requests(self):对于 self.start_urls 中的 url:self._monkey_patching_HTTPClientParser_statusReceived()产量请求(网址,dont_filter=True)

希望有帮助.

I am trying to use SCRAPY to scrape this website's search reqults for any search query - http://www.bewakoof.com .

The website uses AJAX (in the form of XHR) to display the search results. I managed to track the XHR, and you notice it in my code as below (inside the for loop, wherein i am storing the URL to temp, and incrementing 'i' in the loop)-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp=( "http://www.bewakoof.com/search/searchload/search_text/" + query + "/page_num/" + str(i) )
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)


spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

Now, as I execute this, I get unexpected errors-:

2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-09 11:46:01 [scrapy] INFO: Optional features available: ssl, http11
2015-07-09 11:46:01 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-09 11:46:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-09 11:46:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-09 11:46:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-09 11:46:02 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-09 11:46:02 [scrapy] INFO: Spider opened
2015-07-09 11:46:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-09 11:46:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-09 11:46:03 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:09 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] INFO: Closing spider (finished)
2015-07-09 11:46:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)}
2015-07-09 11:46:13 [scrapy] INFO: Spider closed (finished)

If you correctly see my code, I have also set the DOWNLOAD_DELAY=5, still it gives the same errors as to when I didn't keep it. I also increased DOWNLOAD_DELAY=10, still it gives the same errors. I have read many questions related to this on Stack Overflow, also on GitHub , but none of them seem to help.

I read in one ofthe answers, that TOR with Polipo, can help. But, I am a bit doubtful for using it, because I don't know whether is it legal to use the combination of TOR with Polipo to scrape websites using Scrapy? (I don't want to run into trouble with any legal issues.) That is the reason why I didn't prefer to use it. So, in case if it is legal, please provide the code for my SPECIFIC CASE, using TOR and POLIPO.

Or rather, if that is illegal, Help me resolve it without using them.

Please help me resolve these errors!

EDIT:

This is my updated code-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()




class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def _monkey_patching_HTTPClientParser_statusReceived(self):

        from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
        old_sr = HTTPClientParser.statusReceived
        def statusReceived(self, status):
            try:
                return old_sr(self, status)
            except ParseError, e:
                if e.args[0] == 'wrong number of parts':
                    return old_sr(self, status + ' OK')
                raise
        statusReceived.__doc__ == old_sr.__doc__
        HTTPClientParser.statusReceived = statusReceived




    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp = "http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1"
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        self._monkey_patching_HTTPClientParser_statusReceived()
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)

        print (items)

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

And this is my updated output, as displayed on the terminal-:

2015-07-10 13:06:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-10 13:06:00 [scrapy] INFO: Optional features available: ssl, http11
2015-07-10 13:06:00 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-10 13:06:01 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-10 13:06:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-10 13:06:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-10 13:06:01 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-10 13:06:01 [scrapy] INFO: Spider opened
2015-07-10 13:06:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-10 13:06:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-10 13:06:02 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:08 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:13 [scrapy] INFO: Closing spider (finished)
2015-07-10 13:06:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)}
2015-07-10 13:06:13 [scrapy] INFO: Spider closed (finished)

So, as you see the errors are still the same! :( . So, please help me resolve this!

UPDATED-:

This the output when I try to catch the exception that @JoeLinux suggested to do-:

>>> try:
...     fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
... except Exception as e:
...     e
... 
2015-07-10 17:51:13 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:14 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:15 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
ResponseFailed([<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>],)
>>> print e.reasons[0].getTraceback()
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 214, in doRead
    return self._dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 220, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 114, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1523, in dataReceived
    self._parser.dataReceived(bytes)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
    why = self.lineReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 271, in lineReceived
    self.statusReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 409, in statusReceived
    raise ParseError("wrong number of parts", status)
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

解决方案

I got the same error

[<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]

and now it works.

I think you could try this:

  • in method _monkey_patching_HTTPClientParser_statusReceived, change from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError to from twisted.web._newclient import HTTPClientParser, ParseError;

  • in method start_requests, call _monkey_patching_HTTPClientParser_statusReceived for every request in start_urls, for example: def start_requests(self): for url in self.start_urls: self._monkey_patching_HTTPClientParser_statusReceived() yield Request(url, dont_filter=True)

Hope it helps.

这篇关于扭曲的 Python 失败 - Scrapy 问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆