Tweepy连接断开:IncompleteRead-处理异常的最佳方法?还是可以避免穿线? [英] Tweepy Connection broken: IncompleteRead - best way to handle exception? or, can threading help avoid?

查看:172
本文介绍了Tweepy连接断开:IncompleteRead-处理异常的最佳方法?还是可以避免穿线?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用tweepy处理大量的Twitter流(有4,000多个帐户)。我添加到信息流中的帐户越多,我越有可能出现此错误:

I am using tweepy to handle a large twitter stream (following 4,000+ accounts). The more accounts that I add to the stream, the more likely I am to get this error:

Traceback (most recent call last):
  File "myscript.py", line 2103, in <module>
main()
  File "myscript.py", line 2091, in main
    twitter_stream.filter(follow=USERS_TO_FOLLOW_STRING_LIST,     stall_warnings=True)
  File "C:\Python27\lib\site-packages\tweepy\streaming.py", line 445, in filter
self._start(async)
  File "C:\Python27\lib\site-packages\tweepy\streaming.py", line 361, in _start
self._run()
  File "C:\Python27\lib\site-packages\tweepy\streaming.py", line 294, in _run
raise exception
requests.packages.urllib3.exceptions.ProtocolError: ('Connection broken:     IncompleteRead(0 bytes read, 2000 more expected)', IncompleteRead(0 bytes read, 2000 more expected))

很明显,这是一个很厚的火喉-凭经验显然,这是难以处理。基于对堆栈溢出错误的研究以及我要添加的帐户越多,出现此异常的速度越快的经验趋势,我的假设是这是我的错。我对每条推文的处理时间太长和/或我的firehose太厚。我明白了。

Obviously that is a thick firehose - empirically obviously, it's too thick to handle. Based on researching this error on stackoverflow as well as the empirical trend that 'the more accounts to follow I add, the faster this exception occurs', my hypothesis is that this is 'my fault'. My processing of each tweet takes too long and/or my firehose is too thick. I get that.

但是尽管进行了这种设置,但我仍然有两个问题似乎找不到可靠的答案。

1。有没有一种方法可以简单地处理此异常,接受我会错过一些推文,但保持脚本运行?我认为也许错过了一条Tweet(或许多Tweet'),但是如果我可以在没有100%我想要的Tweet的情况下生活,那么脚本/流仍然可以继续,随时可以捕获下一条Tweet。

But notwithstanding that setup, I still have two questions that I can't seem to find solid answers for.
1. Is there a way to simply 'handle' this exception, accept that I will miss some tweets, but keep the script running? I figure maybe it misses a tweet (or many tweets', but if I can live without 100% of the tweets I want, then the script/stream can still go on, ready to catch the next tweet whenever it can.

我已经尝试过这种异常处理,在类似的stackoverflow问题中对此建议使用:urllib3.exceptions中的
导入ProtocolError

I've tried this exception handling, which was recommended for that in a similar question on stackoverflow: from urllib3.exceptions import ProtocolError

    while True:
        try:
            twitter_stream.filter(follow=USERS_TO_FOLLOW_STRING_LIST, stall_warnings=True)

        except ProtocolError:
            continue

但是对我来说很不幸,(也许我执行不正确,但是

But unfortunately for me, (perhaps I implemented it incorrectly, but I don't think I did), that did not work. I get the same exact error I was previously getting with or without that recommended exception handling code in place.


  1. 我从未在我的python代码中实现队列和/或线程。这对m来说是个好时机吗? e尝试实现该目标?我对队列/线程一无所知,但我在想... ...

我可以写些鸣叫吗?在原始(预处理)到一个线程的内存,数据库或其他东西上?然后,准备好第二个线程准备好处理那些推文吗?我认为,至少,将推文的后处理排除在等式之外,这是我正在读取的消防水带带宽的限制因素。然后,如果仍然出现错误,我可以减少关注的对象,等等。

Could I have the tweets sort of written - in the raw - pre-processing - to memory, or a database, or something, on one thread? And then, have a second thread ready to do the processing of those tweets, as soon as it's ready? I figure that way, at least, it takes my post-processing of the tweet out of the equation as a limiting factor on the bandwidth of the firehose I am reading. Then if I still get the error I can cut back on who I am following, etc.

我看了一些线程教程,但认为可能值得问一问是否可行 '与...这个tweepy / twitter / etc /复杂。我对自己所遇到的问题或线程的帮助方式不甚了解,因此我想请教一下有关确实对我有帮助的建议。

I have watched some threading tutorials but figured might be worth asking if that 'works' with ... this tweepy/twitter/etc/ complex. I am not confident in my understanding of the problem I have or how threading might help, so figured I could ask for advice as to if indeed that would help me here.

如果这个想法是正确的,那么有人可以帮助我指出正确的方向吗? $ b

If this idea is valid, is there a sort of simple piece of example code someone could help me with to point me in the right direction?

推荐答案

我认为我终于完成了第一个队列/线程实现,从而解决了这个问题。我还没有足够的知识来了解执行此操作的最佳方法,但是我认为这种方法确实有效。使用下面的代码,我现在建立了一个新的推文队列,可以按我希望的顺序处理它们,而不是落后并失去与tweepy的连接。

I think i solved this problem by finally completing my first queue/thread implementation. I am not learned enough to know the best way to do this, but I think this way does work. Using the below code I now build up a queue of new tweets and can handle them as I wish in the queue, rather than falling behind and losing my connection with tweepy.

from Queue import Queue
from threading import Thread 

class My_Parser(tweepy.StreamListener):

    def __init__(self, q = Queue()):

        num_worker_threads = 4
        self.q = q
        for i in range(num_worker_threads):
             t = Thread(target=self.do_stuff)
             t.daemon = True
             t.start()

    def on_data(self, data):

        self.q.put(data)


    def do_stuff(self):
        while True:

            do_whatever(self.q.get())


            self.q.task_done()

我确实继续挖掘了一段时间关于IncompleteRead错误,我尝试了更多使用url libs和http l的异常处理解决方案ibs,但我为此感到挣扎。而且我认为除了保持连接之外,排队的东西还是有一些好处的(对于其中一个,不会丢失数据)。

I did continue digging for a while about the IncompleteRead error and I tried numerous more Exception handlings solutions using url libs and http libs but I struggled with that. And I think there may be some benefits to the queueing stuff anyway outside of just keeping the connection (for one, won't lose data).

希望这对某人。哈哈。

这篇关于Tweepy连接断开:IncompleteRead-处理异常的最佳方法?还是可以避免穿线?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆