python请求&Beautifulsoup 机器人检测 [英] python requests & beautifulsoup bot detection
问题描述
我正在尝试使用 requests & 抓取页面的所有 HTML 元素.美汤.我正在使用 ASIN(亚马逊标准识别码)来获取页面的产品详细信息.我的代码如下:
I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)
但是 输出 没有显示页面的整个 HTML,因此我无法进一步处理产品详细信息.有什么帮助吗?
But the output doesn't show the entire HTML of the page, so I can't do my further work with product details. Any help on this?
编辑 1:
从给定的答案中,它显示了机器人检测页面的标记.我研究了一下&找到了两种破坏它的方法:
From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :
- 我可能需要在请求中添加标头,但我无法理解标头的值应该是什么.
- 使用硒.现在我的问题是,这两种方式都能提供同等的支持吗?
推荐答案
正如一些评论已经建议的那样,如果您需要以某种方式在页面上与 Javascript 交互,最好使用 selenium.但是,关于您使用标题的第一种方法:
As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
这些标题有点旧,但应该仍然有效.通过使用它们,您可以假装您的请求来自普通的网络浏览器.如果您使用 requests
没有这样的标头,您的代码基本上是在告诉服务器请求来自 python,大多数服务器都会立即拒绝.
These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests
without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.
你的另一个选择也可以是 fake-useragent 也许你也可以有一个试试这个.
Another alternative for you could also be fake-useragent maybe you can also have a try with this.
这篇关于python请求&Beautifulsoup 机器人检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!