尝试请求页面时读取超时 [英] Read Time out when attempting to request a page

查看:34
本文介绍了尝试请求页面时读取超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取网站,有时会出现此错误,这与我随机得到此错误有关,但是在重试后我没有得到该错误.

  requests.exceptions.ReadTimeout:HTTPSConnectionPool(主机="www.somewebsite.com",端口= 443):读取超时.(读取超时=无) 

我的代码如下所示

从bs4

 导入BeautifulSoup从random_user_agent.user_agent导入UserAgent从random_user_agent.params导入SoftwareName,OperatingSystem汇入要求software_names = [SoftwareName.CHROME.value]operating_systems = [OperatingSystem.WINDOWS.value,OperatingSystem.LINUX.value]user_agent_rotator = UserAgent(软件名称=软件名称,操作系统=操作系统,限制= 100)pages_to_scrape = ['https://www.somewebsite1.com/page','https://www.somewebsite2.com/page242']用于pages_to_scrape中的页面:time.sleep(2)页面= requests.get(页面标题为{'User-Agent':user_agent_rotator.get_random_user_agent()})汤= BeautifulSoup(page.content,"html.parser")#抓取信息 

从代码中可以看到,在请求另一个页面之前,我甚至使用Time将脚本休眠了几秒钟.我还使用了一个随机的user_agent.我不确定是否可以做其他事情来确保我永远不会出现读取超时"错误.

我还遇到了

如果没有有效的 User-Agent ,并且应该为 stable ,它将阻止您的请求.您无需在每个请求上都进行更改.同样,您将需要使用 requests.Session()来持久化 session ,并且不会导致 TCP 层丢弃数据包,我一直在能够在第二秒内发送1000个请求,并且没有被阻止.即使我验证了 bootstrap 是否会在解析HTML源代码时也阻止该请求,但根本没有.

被告知我使用 Google DNS 启动了所有测试,这不会导致线程延迟,从而导致 firewall 删除请求并将其定义为 DDOS攻击.还有一点要注意.请勿使用timeout = None ,因为这将导致请求永远等待响应,而在后端,防火墙会自动检测到 TCP侦听器>待处理状态,然后自动将其删除并阻止您的原始IP .基于配置的时间:) –

 导入请求从parallel.futures.thread导入ThreadPoolExecutor从bs4导入BeautifulSoupdef测试(数字):打印(f线程#{num}")与request.session()作为要求:标头= {'用户代理':'Mozilla/5.0(Windows NT 10.0; Win64; x64; rv:73.0)Gecko/20100101 Firefox/73.0'}r = req.get("https://www.uniqlo.com/us/en/men/t-shirts",headers=headers)汤= BeautifulSoup(r.text,'html.parser')如果r.status_code == 200:返回soup.title.text别的:返回f线程#{num}失败"使用ThreadPoolExecutor(max_workers = 20)作为执行者:期货= executor.map(测试,范围(1,31))对于期货的未来:打印(未来) 

在线运行

I am attempting to scrape websites and I sometimes get this error and it is concerning as I randomly get this error but after i retry i do not get the error.

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.somewebsite.com', port=443): Read timed out. (read timeout=None)

My code looks like the following

from bs4 import BeautifulSoup
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
import requests

software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=100)
pages_to_scrape = ['https://www.somewebsite1.com/page', 'https://www.somewebsite2.com/page242']

for page in pages_to_scrape:
  time.sleep(2)
  page = requests.get(page, headers={'User-Agent':user_agent_rotator.get_random_user_agent()})
  soup = BeautifulSoup(page.content, "html.parser")
  # scrape info 

As you can see from my code I even use Time to sleep my script for a couple of seconds before requesting another page. I also use a random user_agent. I am not sure if i can do anything else to make sure I never get the Read Time out error.

I also came across this but it seems they are suggesting to add additional values to the headers but I am not sure if that is a generic solution because that may have to be specific from website to website. I also read on another SO Post that we should base64 the request and retry. It went over my head as I have no idea how to do that and there was not a example provided by the person.

Any advice by those who have experience in scraping would highly be appreciated.

解决方案

well, I've verified your issue. Basically that site is using AkamaiGHost firewall.

curl -s -o /dev/null -D - https://www.uniqlo.com/us/en/men/t-shirts

which will block your requests if it's without valid User-Agent and should be stable. you don't need to change it on each request. also you will need to use requests.Session() to persist the session and not causing TCP layer to drop the packets, I've been able to send 1k requests within the second and didn't get blocked. even i verified if the bootstrap will block the request if i parsed the HTML source but it didn't at all.

being informed that i launched all my tests using Google DNS which will never cause a latency on my threading which can lead the firewall to drop the requests and define it as DDOS attack. One point to be noted as well. DO NOT USE timeout=None as that's will cause the request to wait forever for a response where in the back-end the firewall is automatically detecting any TCP listener which in pending state and automatically drop it and block the origin IP which is you. that's based on time configured :) –

import requests
from concurrent.futures.thread import ThreadPoolExecutor
from bs4 import BeautifulSoup


def Test(num):
    print(f"Thread# {num}")
    with requests.session() as req:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
        r = req.get(
            "https://www.uniqlo.com/us/en/men/t-shirts", headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        if r.status_code == 200:
            return soup.title.text
        else:
            return f"Thread# {num} Failed"


with ThreadPoolExecutor(max_workers=20) as executor:
    futures = executor.map(Test, range(1, 31))
    for future in futures:
        print(future)

Run It Online

这篇关于尝试请求页面时读取超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆