使用BeautifulSoup 429错误使用Python进行Web抓取 [英] Web scraping with Python using BeautifulSoup 429 error

查看:155
本文介绍了使用BeautifulSoup 429错误使用Python进行Web抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

第一拳,我不得不说我对使用Python进行网络抓取非常陌生.我正在尝试使用这些代码行抓取数据

import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)

作为输出,我没有得到预期的HTML页面,但另一个HTML页面却说:内容抓取工具行为不当 请使用robots.txt 您的IP受到速率限制

要检查我写的问题:

try:
page_response = requests.get(baseurl, timeout =5)
 if page_response.status_code ==200:
   html_page = requests.get(baseurl).text
   soup = BeautifulSoup(html_page, 'html.parser')

 else:
  print(page_response.status_code)
except requests.Timeout as e:
print(str(e))

然后我得到429(请求太多).

我该怎么做才能解决这个问题?这是否意味着我无法打印页面的HTML,是否阻止了我刮擦页面的任何内容?我应该旋转IP地址吗?

解决方案

如果您只点击页面一次并获得429,则可能不是您点击太多了.您无法确定429错误的准确性,这只是他们的网络服务器返回的内容.我见过页面返回404响应代码,但页面还不错,合法丢失的页面上有200响应代码,只是服务器配置错误.他们可能只是从任何漫游器返回429,尝试将您的User-Agent更改为Firefox,Chrome或"Robot Web Scraper 9000",然后看看会得到什么.像这样:

requests.get(baseurl, headers = {'User-agent': 'Super Bot Power Level Over 9000'})

声明自己是机器人或

requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})

如果您想更多地模仿浏览器.注意所有模仿浏览器的版本,在撰写本文时,这些都是最新的.您可能需要更高的版本号.只需找到您使用的浏览器的用户代理,此页面就会告诉您什么:

https://www.whatismybrowser.com/detect/what -is-my-user-agent

如果您只是说自己是机器人,则某些网站会返回更好的可搜索代码,而其他网站则相反.基本上是狂野的西部,必须尝试不同的事情.

另一个提示,您可能必须编写代码才能使用饼干罐"或接受Cookie的方法.通常这只是您请求中的额外一行,但我将其留给另一个stackoverflow问题:)

如果确实要打很多,则需要在通话之间睡觉.这是完全由他们控制的服务器端响应.您还需要调查代码与robots.txt的交互方式,该文件通常位于Web服务器的根目录,具有希望蜘蛛遵循的规则.

您可以在此处阅读有关内容:在python中解析Robots.txt

在网络上玩蜘蛛既有趣又充满挑战,请记住,您是他们的客人,随时可能由于任何原因被任何网站阻止.所以踩得好:)

Fist I have to say that I'm quite new to Web scraping with Python. I'm trying to scrape datas using these lines of codes

import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)

As output I do not get the expected Html page but another Html page that says : Misbehaving Content Scraper Please use robots.txt Your IP has been rate limited

To check the problem I wrote:

try:
page_response = requests.get(baseurl, timeout =5)
 if page_response.status_code ==200:
   html_page = requests.get(baseurl).text
   soup = BeautifulSoup(html_page, 'html.parser')

 else:
  print(page_response.status_code)
except requests.Timeout as e:
print(str(e))

Then I get 429 (too many requests).

What can I do to handle this problem? Does it mean I cannot print the Html of the page and does it prevent me to scrape any content of the page? Should I rotate the IP address ?

解决方案

If you are only hitting the page once and getting a 429 it's probably not you hitting them too much. You can't be sure the 429 error is accurate, it's simply what their webserver returned. I've seen pages return a 404 response code, yet the page was fine, and 200 response code on legit missing pages, just a misconfigured server. They may just return 429 from any bot, try changing your User-Agent to Firefox, Chrome, or "Robot Web Scraper 9000" and see what you get. Like this:

requests.get(baseurl, headers = {'User-agent': 'Super Bot Power Level Over 9000'})

to declare yourself as a bot or

requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})

If you wish to more mimic a browser. Note all the version stuff mimicing a browser, at the time of this writing those are current. You may need later version numbers. Just find your user agent of the browser you use, this page will tell you what that is:

https://www.whatismybrowser.com/detect/what-is-my-user-agent

Some sites return better searchable code if you just say you are a bot, others it's the opposite. It's basically the wild wild west, have to just try different things.

Another pro tip, you may have to write your code to have a 'cookie jar' or a way to accept a cookie. Usually it is just an extra line in your request, but I'll leave that for another stackoverflow question :)

If you are indeed hitting them a lot, you need to sleep between calls. It's a server side response completely controlled by them. You will also want to investigate how your code interacts with robots.txt, that's a file usually on the root of the webserver with the rules it would like your spider to follow.

You can read about that here: Parsing Robots.txt in python

Spidering the web is fun and challenging, just remember that you could be blocked at anytime by any site for any reason, you are their guest. So tread nicely :)

这篇关于使用BeautifulSoup 429错误使用Python进行Web抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆