如何在 Python 请求上轮换代理 [英] How to rotate proxies on a Python requests

查看:33
本文介绍了如何在 Python 请求上轮换代理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试进行一些抓取,但每 4 个请求就会被阻止.我试图更改代理,但错误是一样的.我应该怎么做才能正确更改它?

I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?

这是我尝试的一些代码.首先,我从一个免费的网络获取代理.然后我使用新代理执行请求但它不起作用,因为我被阻止了.

Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.

from fake_useragent import UserAgent
import requests

def get_player(id,proxy):
    ua=UserAgent()
    headers = {'User-Agent':ua.random}

    url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)

    try:
        print(proxy)
        r=requests.get(u,headers=headers,proxies=proxy)
    execpt:

....
code to manage the data
....

获取代理

def get_proxies():
    ua=UserAgent()
    headers = {'User-Agent':ua.random}
    url='https://free-proxy-list.net/'

    r=requests.get(url,headers=headers)
    page = BeautifulSoup(r.text, 'html.parser')

    proxies=[]

    for proxy in page.find_all('tr'):
        i=ip=port=0

    for data in proxy.find_all('td'):
        if i==0:
            ip=data.get_text()
        if i==1:
            port=data.get_text()
        i+=1

    if ip!=0 and port!=0:
        proxies+=[{'http':'http://'+ip+':'+port}]

return proxies

调用函数

proxies=get_proxies()
for i in range(1,100):
    player=get_player(i,proxies[i//4])

....
code to manage the data  
....

我知道代理抓取很好,因为当我打印时,我会看到类似的内容:{'http': 'http://88.12.48.61:42365'}我不想被屏蔽.

I know that proxies scrape is well because when i print then I see something like: {'http': 'http://88.12.48.61:42365'} I would like to don't get blocked.

推荐答案

我最近遇到了同样的问题,但是按照其他答案中的建议在线使用代理服务器总是有风险的(从隐私角度来看)、速度慢或不可靠.

I recently had this same issue, but using proxy servers online as recommended in other answers is always risky (from privacy standpoint), slow, or unreliable.

相反,您可以使用 requests-ip-rotator python 库来代理流量通过 AWS API Gateway,它每次都会为您提供一个新 IP:
pip install requests-ip-rotator

Instead, you can use the requests-ip-rotator python library to proxy traffic through AWS API Gateway, which gives you a new IP each time:
pip install requests-ip-rotator

这可以如下使用(专门针对您的网站):

This can be used as follows (for your site specifically):

import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS

gateway = ApiGateway("https://www.transfermarkt.es")
gateway.start()

session = requests.Session()
session.mount("https://www.transfermarkt.es", gateway)

response = session.get("https://www.transfermarkt.es/jadon-sancho/profil/spieler/your_id")
print(response.status_code)

# Only run this line if you are no longer going to run the script, as it takes longer to boot up again next time.
gateway.shutdown() 

结合多线程/多处理,您将能够立即抓取网站.

Combined with multithreading/multiprocessing, you'll be able to scrape the site in no time.

AWS 免费套餐为您提供每个区域 100 万个请求,因此此选项对于所有合理的抓取都是免费的.

The AWS free tier provides you with 1 million requests per region, so this option will be free for all reasonable scraping.

这篇关于如何在 Python 请求上轮换代理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆