使用 Beautiful Soup 的请求被阻止 [英] Requests using Beautiful Soup gets blocked

查看:27
本文介绍了使用 Beautiful Soup 的请求被阻止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用 Beautiful Soup 提出请求时,我被阻止为机器人".

When I make requests using Beautiful Soup, I get blocked as a "bot".

import requests
from bs4 import BeautifulSoup

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/")
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

然后我收到来自 Reddit 的消息,说他们怀疑我是机器人.

Then I get messages from Reddit saying that they suspected me as a bot.

Beautiful Soup 有哪些可能的解决方案?(我曾试过 Scrapy 使用它的 Crawlera,但由于我缺乏 Python 知识,我无法使用它.)我不介意它是否是付费服务,只要它直观"即可.足够初学者使用.

What are possible solutions through Beautiful Soup? (I have tried Scrapy to use its Crawlera, but due to my lack of python knowledge, I cannot use it.) I don't mind if it is a paid service, as long as it is "intuitive" enough for a beginner to use.

推荐答案

作为机器人被阻止的原因可能有多种.

There can be various reasons for being blocked as a bot.

当您按原样"使用请求库时,阻塞的最可能原因是缺少用户代理标头.

As you are using the requests library "as is", the most probable reason for the block is a missing User Agent header.

抵御机器人和抓取的第一道防线是检查用户代理标头是否来自主要浏览器之一,并阻止所有非浏览器用户代理.

A first line of defense against bots and scraping is to check the User Agent header for being from one of the major browsers and block all non-browser user agents.

简短版本:试试这个:

import requests
from bs4 import BeautifulSoup

headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/", headers=headers)
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

详细说明:发送用户代理"在 Python 中使用 Requests 库

这篇关于使用 Beautiful Soup 的请求被阻止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆