Python 请求获取 ('Connection aborted.', BadStatusLine("''",)) 错误 [英] Python Requests getting ('Connection aborted.', BadStatusLine("''",)) error

查看:1024
本文介绍了Python 请求获取 ('Connection aborted.', BadStatusLine("''",)) 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

def download_torrent(url):
    fname = os.getcwd() + '/' + url.split('title=')[-1] + '.torrent'
    try:
        schema = ('http:')
        r = requests.get(schema + url, stream=True)
        with open(fname, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()
    except requests.exceptions.RequestException as e:
        print('\n' + OutColors.LR + str(e))
        sys.exit(1)

    return fname

在该代码块中,当我运行完整脚本时出现错误.当我去实际下载 torrent 时,我得到:

In that block of code I am getting an error when I run the full script. When I go to actually download the torrent, I get:

('Connection aborted.', BadStatusLine("''",))

我只发布了我认为与上面相关的代码块.整个脚本如下.它来自pantuts,但我认为它不再维护了,我正试图让它与python3一起运行.根据我的研究,该错误可能意味着我使用的是 http 而不是 https,但我都尝试过.

I only posted the block of code that I think is relevant above. The entire script is below. It's from pantuts, but I don't think it's maintained any longer, and I am trying to get it running with python3. From my research, the error might mean I'm using http instead of https, but I have tried both.

原始脚本

推荐答案

您得到的错误表明主机没有以预期的方式响应.在这种情况下,这是因为它检测到您正在尝试抓取它并故意与您断开连接.

The error you get indicates the host isn't responding in the expected manner. In this case, it's because it detects that you're trying to scrape it and deliberately disconnecting you.

如果您使用来自测试网站的此 URL 尝试您的 requests 代码:http://mirror.internode.on.net/pub/test/5meg.test1,您会看到它正常下载.

If you try your requests code with this URL from a test website: http://mirror.internode.on.net/pub/test/5meg.test1, you'll see that it downloads normally.

要解决此问题,请伪造您的用户代理.用户代理会识别您的网络浏览器,网络主机通常会检查它以检测机器人.

To get around this, fake your user agent. Your user agent identifies your web browser, and web hosts commonly check it to detect bots.

使用 headers 字段来设置您的用户代理.这是一个示例,它告诉网络主机您是 Firefox.

Use the headers field to set your user agent. Here's an example which tells the webhost you're Firefox.

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0' }
r = requests.get(url, headers=headers)

<小时>

网络主机可以检查机器人和人工操作的浏览器之间还有很多其他差​​异1,但用户代理是最简单和常见的差异之一.


There are lots of other discrepancies1 between bots and human-operated browsers that web hosts can check for, but user agent is one of the easiest and common ones.

如果你想让你的爬虫更难被检测到,你会想要使用像 无头 Chrome2(或 ghost.py,如果你想坚持使用 Python),你可以相信它会像真正的浏览器一样运行(因为它是!).

If you want your scraper to be harder to detect, you'll want to use a headless browser like headless Chrome2 (or ghost.py if you want to stick with Python), which you can trust will behave like a real browser (because it is!).

脚注:

1可能的其他检查包括检查图像是否未下载、页面资源未按正常顺序下载、页面下载速度是否超过人类阅读速度,和 cookie 设置不正确.Google 会标记被认为不够像人类的鼠标移动.

1Possible other checks include checks for if images aren't being downloaded, page resources aren't downloaded in the normal order, pages being downloaded faster than a human can read them, and cookies not being set properly. Google flags mouse movements deemed insufficiently human-like.

2Headless Chrome 是 2018 年最能干的 Headless 浏览器,但如果它的重量对你来说是个问题,它稍微过时的前辈,PhantomJSghost.py,重量更轻,仍然可以使用.

2Headless Chrome is the most competent headless browser in 2018, but if its weight is a problem for you, its slightly-outdated predecessors, PhantomJS and ghost.py, are lighter weight and still usable.

这篇关于Python 请求获取 ('Connection aborted.', BadStatusLine("''",)) 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆