scrapy 飞溅应用程序中的一长串异常 [英] Long chain of exceptions in scrapy splash application

查看:114
本文介绍了scrapy 飞溅应用程序中的一长串异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 scrapy 应用程序输出了这一长串异常,我没能看出问题是什么,最后一个让我特别困惑.

My scrapy application is outputting this long chain of exceptions and I am failing to see what the issue is and the last one has me especially confused.

在我解释为什么这里是链之前:

Before I explain why here is the chain:

2020-11-04 17:38:58,394:ERROR:Error while obtaining start requests
Traceback (most recent call last):
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 1347, in getresponse
    response.begin()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\util\retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\packages\six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 1347, in getresponse
    response.begin()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 35, in _update
    r = requests.get(url=self.URL)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy_splash\middleware.py", line 167, in process_start_requests
    for req in start_requests:
  File "C:\Users\lguarro\Documents\Work\SearchEngine_Pure\SearchEngine_Pure\spiders\SearchEngine.py", line 36, in start_requests
    user_agent = self.ua.random_nomobile
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 120, in random_nomobile
    return self.pickrandom(exclude_mobile=True)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 83, in pickrandom
    self.update()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 59, in update
    self._update()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 38, in _update
    self.logger.error(r.content.decode('utf-8'))
UnboundLocalError: local variable 'r' referenced before assignment

现在最后一个例外是抱怨一些

Now the last exception is complaining about some

UnboundLocalError: 赋值前引用了局部变量 'r'

UnboundLocalError: local variable 'r' referenced before assignment

该跟踪中唯一属于我的代码是 SearchEngine.py 文件,它甚至没有变量r",因此让我很困惑.下面是发生错误的 start_requests 的实现:

The only code which is mine in that trace is the SearchEngine.py file which doesn't even have a variable 'r' thus leaving me very confused. Here is the implementation of start_requests from which the error occurs:

def start_requests(self):
    user_agent = self.ua.random_nomobile # Exception raised here
    rec = self.mh.FindIdleOneWithNoURLs()
    if rec:
        self.logger.info("Starting url scrape for company, %s using user agent: %s", rec["Company"], user_agent)
        script = self.template.substitute(useragent=user_agent, searchquery=rec["Company"])
        yield SplashRequest(url=self.url, callback=self.parse, endpoint="execute", 
            args={
                'lua_source': script
            },
            meta={'RecID': rec["_id"], 'Company': rec["Company"]},
            errback = self.logerror
        )

它在抱怨该函数中的第一行,我认为没有问题.

It is complaining about the first line in that function for which I see no problem.

如果相关,我还要补充一点,昨天我的脚本似乎运行良好,但今天我不得不重置我的 Docker 配置(启动容器正在运行),从那时起我就无法顺利运行​​我的脚本.

In case it is relevant, I will also add that my script seemed to be running fine just yesterday but today I had to reset my Docker configuration (that the splash container is running in) and since then I haven't been able to run my script smoothly.

推荐答案

我找到了导致问题的原因!实际上这部分没有错误,而是 shadow-useragent 库中的一个错误.

I found out what was causing the issue! In fact there was no error on part, instead it is a bug inside the shadow-useragent library.

该库会定期发出 API 请求以获取最常用的用户代理列表,并且与此 API 对应的服务器已关闭,并且 shadow-useragent 的作者未正确处理异常.

The library periodically makes an API request to fetch a list of the most used user agents and the server corresponding to this API is down and the authors of shadow-useragent were not properly handling the exception.

幸运的是 shadow-useragent 确实缓存了它最近能够接收的用户代理列表.因此,我的解决方案是编辑 shadow-useragent 代码以完全绕过更新功能,并在计划更新之外使用缓存列表(在 data.pk 文件中).如果其他人遇到此问题,这是您可以使用的临时解决方案,直到该服务器再次启动并运行.. 希望很快!

Fortunately shadow-useragent does cache the list of user agents that it was most recently able to receive. So my solution was to edit the shadow-useragent code to bypass the update function entirely and to use the cached list (inside the data.pk file) beyond its scheduled update. If anyone else runs into this issue, this is a temporary solution you can use until that server gets up and running again.. hopefully soon!

这篇关于scrapy 飞溅应用程序中的一长串异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆