使用美丽汤的请求被阻止 [英] Requests using Beautiful Soup gets blocked

查看:69
本文介绍了使用美丽汤的请求被阻止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是卡内基·梅隆大学的新生,在他的第一个学期项目中完全迷路了.

I am a freshman at Carnegie Mellon who is completely lost in his first term project.

当我使用Beautiful Soup发出请求时,被机器人"阻止.

When I make requests using Beautiful Soup, I get blocked as a "bot".

import requests
from bs4 import BeautifulSoup

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/")
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

然后我从Reddit收到消息,说他们怀疑我是机器人.

Then I get messages from Reddit saying that they suspected me as a bot.

  1. 通过美丽的汤"有哪些可能的解决方案? (我已经尝试过Scrapy使用它的Crawlera,但是由于我缺乏python的知识,所以我不能使用它.)我不介意它是否是一项付费服务​​,只要它对初学者来说足够直观"即可使用.

非常感谢您的帮助.

此致

Isaac Lee

Isaac Lee

推荐答案

被阻止为漫游器可能有多种原因.

There can be various reasons for being blocked as a bot.

当您按原样使用请求库时,该块最可能的原因是缺少用户代理标头.

As you are using the requests library "as is", the most probable reason for the block is a missing User Agent header.

针对机器人和抓取程序的第一道防线是检查用户代理标头是否来自主要浏览器之一,并阻止所有非浏览器用户代理.

A first line of defense against bots and scraping is to check the User Agent header for being from one of the major browsers and block all non-browser user agents.

简短版本:请尝试以下操作:

import requests
from bs4 import BeautifulSoup

headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/", headers=headers)
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

详细说明: 发送"User-agent"在Python中使用请求库

这篇关于使用美丽汤的请求被阻止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆