python请求和beautifulsoup bot检测 [英] python requests & beautifulsoup bot detection

查看：67 发布时间：2020/9/20 6:27:28 python html web-scraping beautifulsoup python-requests

本文介绍了python请求和beautifulsoup bot检测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用请求&抓取页面的所有HTML元素.美丽的汤.我正在使用ASIN(亚马逊标准标识号)来获取页面的产品详细信息.我的代码如下:

I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:

from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)

但是输出不能显示页面的整个HTML ，因此我无法进一步处理产品详细信息. 有什么帮助吗?

But the output doesn't show the entire HTML of the page, so I can't do my further work with product details. Any help on this?

从给定的答案开始，它显示了漫游器检测页面的标记.我研究了&找到了两种方法来破坏它:

From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :

我可能需要在请求中添加标头，但我不明白标头的值应该是什么.
使用硒. 现在我的问题是，这两种方式是否提供平等的支持?

I might need to add a header in the requests, but I couldn't understand what should be the value of header.
Use Selenium. Now my question is, do both of the ways provide equal support?

推荐答案

正如一些建议所建议的，如果您需要以某种方式与页面上的Javascript进行交互，则最好使用Selenium.但是，关于使用标头的第一种方法:

As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:

import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")

这些标头有些旧，但仍然可以使用.通过使用它们，您假装您的请求来自普通的Web浏览器.如果您使用requests而没有这样的标头，则您的代码基本上是在告诉服务器该请求来自python，而大多数服务器会立即拒绝该请求.

These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.

另一个适合您的替代方法也可以是 fake-useragent ，也许您也可以拥有一个尝试这个.

Another alternative for you could also be fake-useragent maybe you can also have a try with this.

这篇关于python请求和beautifulsoup bot检测的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

python请求和beautifulsoup bot检测 [英] python requests & beautifulsoup bot detection

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

python请求和beautifulsoup bot检测 [英] python requests &amp; beautifulsoup bot detection

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

python请求和beautifulsoup bot检测 [英] python requests & beautifulsoup bot detection

登录关闭