为什么代理会在 Scrapy 中失败,但在 python-requests 库下请求成功 [英] Why would proxies fail in Scrapy, but make succesful requests under the python-requests library

查看:123
本文介绍了为什么代理会在 Scrapy 中失败,但在 python-requests 库下请求成功的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 100 个代理的列表,为了测试它们,我向 google 发出请求并检查响应.当通过 python-requests 运行这些请求时,每个请求都会成功返回,但是在 Scrapy 下尝试相同的事情时,99% 的代理失败.我是否遗漏了什么或在 Scrapy 中使用了错误的代理?

I have a list of say 100 proxies, and to test them I make a request to google and check the response. When running these requests through python-requests, every request returns successfully, but when attempting the same thing under Scrapy, 99% of the time the proxies fail. Am I missing something or using the proxies wrong in Scrapy?

代理以以下格式存储在文件中

The proxies are stored in a file in the format

http://123.123.123.123:8080
https://234.234.234.234:8080
http://321.321.321.321:8080
...

这是我用来用 python-requests 测试它们的脚本

Here's the script I was using to test them with python-requests

import requests

proxyPool = []
with open("proxy_pool.txt", "r") as f:
    proxyPool = f.readlines()

proxyPool = [x.strip() for x in proxyPool]

for proxyItem in proxyPool:
    # Strip the http/s from the ip
    proxy = proxyItem.rsplit("/")[-1].split(":")
    proxy = "{proxy}:{port}".format(proxy=proxy[0], port=proxy[1])
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36", }

    proxySession = requests.Session()
    proxySession.proxies = {"http://": proxy, "https://": proxy}
    proxySession.headers.update(headers)
    resp = proxySession.get("https://www.google.com/")

    if resp.status_code == 200:
        print(f"Requests with proxies: {proxySession.proxies} - Successful")
    else:
        print(f"Requests with proxies: {proxySession.proxies} - Unsuccessful")
    time.sleep(3)

和 Scrapy 的蜘蛛

and the spider for Scrapy

class ProxySpider(scrapy.Spider):
    name = "proxyspider"

    start_urls = ["https://www.google.com/"]

    def start_requests(self):
        with open("proxy_pool.txt", "r") as f:
            for proxy in f.readlines():
                proxy = proxy.strip()
                headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36", }

                yield Request(url=self.start_urls[0], callback=self.parse, headers=headers, meta={"proxy": proxy}, dont_filter=True)

    def parse(self, response):
        self.logger.info(f'Parsing: {response.url}')
        if response.status == 200:
            print(f"Requests with proxies: {response.meta['proxy']} - Successful")
        else:
            print(f"Requests with proxies: {response.meta['proxy']} - Unsuccessful")

推荐答案

在您使用 requests 构建的代码示例中 - 您实现了多个会话(1 个会话 - 1 个代理).

On your code sample built with requests - You implemented multiple sessions (1 session - 1 proxy).

然而,在scrapy的默认设置中 - 应用程序将对所有代理使用单个cookiejar.
它将为每个代理发送相同的 cookie 数据.
您需要使用 cookiejar元键在您的请求中

However on scrapy default settings - application will use single cookiejar for all proxies.
It will send the same cookie data for each proxy.
You need to use cookiejar meta key in your requests

如果网络服务器收到来自多个 IP 的请求,并且在 cookieheaders 中传输单个 sessionId - 它看起来很可疑,网络服务器能够将其识别为机器人并禁止所有使用的 IP.- 大概就是这件事发生了.

If webserver receive requests from multiple IPs with single sessionId transferred in cookieheaders - it looks suspicious and webserver is able identify it as bot and ban all used IPs. - probably exact this thing happened.

这篇关于为什么代理会在 Scrapy 中失败,但在 python-requests 库下请求成功的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆