使用美丽汤的请求被阻止 [英] Requests using Beautiful Soup gets blocked
问题描述
我是卡内基·梅隆大学的新生,在他的第一个学期项目中完全迷路了.
I am a freshman at Carnegie Mellon who is completely lost in his first term project.
当我使用Beautiful Soup发出请求时,被机器人"阻止.
When I make requests using Beautiful Soup, I get blocked as a "bot".
import requests
from bs4 import BeautifulSoup
reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/")
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)
然后我从Reddit收到消息,说他们怀疑我是机器人.
Then I get messages from Reddit saying that they suspected me as a bot.
- 通过美丽的汤"有哪些可能的解决方案? (我已经尝试过Scrapy使用它的Crawlera,但是由于我缺乏python的知识,所以我不能使用它.)我不介意它是否是一项付费服务,只要它对初学者来说足够直观"即可使用.
非常感谢您的帮助.
此致
Isaac Lee
Isaac Lee
推荐答案
被阻止为漫游器可能有多种原因.
There can be various reasons for being blocked as a bot.
当您按原样使用请求库时,该块最可能的原因是缺少用户代理标头.
As you are using the requests library "as is", the most probable reason for the block is a missing User Agent header.
针对机器人和抓取程序的第一道防线是检查用户代理标头是否来自主要浏览器之一,并阻止所有非浏览器用户代理.
A first line of defense against bots and scraping is to check the User Agent header for being from one of the major browsers and block all non-browser user agents.
简短版本:请尝试以下操作:
import requests
from bs4 import BeautifulSoup
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/", headers=headers)
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)
详细说明: 发送"User-agent"在Python中使用请求库
这篇关于使用美丽汤的请求被阻止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!