通过Web抓取Amazon时BeautifulSoup无法正常工作 [英] BeautifulSoup isn't working while web scraping Amazon

查看:53
本文介绍了通过Web抓取Amazon时BeautifulSoup无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Web爬网的新手,我正在尝试在Amazon上使用基本技能.我想编写一个代码,以查找价格,评级和其他信息排名前10位的今日最大交易".

I'm new to web scraping and i am trying to use basic skills on Amazon. I want to make a code for finding top 10 'Today's Greatest Deals' with prices and rating and other information.

每次我尝试使用find()并指定类来查找特定标签时,它总是说'None'.但是,实际的HTML具有该标记.在手动扫描中,我发现的一半代码未在输出终端中显示.显示的代码是一半,但是body和html标记确实关闭了.正文标记中只有一大部分代码丢失.

Every time I try to find a specific tag using find() and specifying class it keeps saying 'None'. However the actual HTML has that tag. On manual scanning i found out half the code of isn't being displayed in the output terminal. The code displayed is half but then the body and html tag do close. Just a huge chunk of code in body tag is missing.

显示的最后一行代码是:

The last line of code displayed is:

<!--[endif]---->

然后身体标签关闭.

这是我正在尝试的代码:

Here is the code that i'm trying:

from bs4 import BeautifulSoup as bs
import requests

source = requests.get('https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals')
soup = bs(source.text, 'html.parser')

print(soup.prettify())
#On printing this it misses some portion of html

article = soup.find('div', class_ = 'a-row dealContainer dealTile')
print(article)
#On printing this it shows 'None'

理想情况下,这应该给我div标记内的代码,以便我可以继续获得该产品的名称.但是,输出仅显示"None".并且在打印不带标签的整个代码时,它会丢失大量的html.

Ideally, this should give me the code within the div tag, so that i can continue further to get the name of the product. However the output just shows 'None'. And on printing the whole code without tags it is missing a huge chunk of html inside.

当然,所需的信息在缺少的html代码中.

And of course the information needed is in the missing html code.

亚马逊是否阻止了我的请求?请帮忙.

Is Amazon blocking my request? Please help.

推荐答案

User-Agent请求标头包含一个特征字符串,该特征字符串使网络协议对等方可以标识请求软件用户代理的应用程序类型,操作系统,软件供应商或软件版本.在服务器端验证User-Agent标头是一项常见操作,因此请确保使用有效的浏览器的User-Agent字符串,以避免被阻止.

The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.

(来源: http://go-colly.org/articles/scraping_related_http_headers/)

您唯一需要做的就是设置一个合法的用户代理.因此,添加标题以模拟浏览器.:

The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :

# This is a standard user-agent of Chrome browser running on Windows 10 
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' } 

示例:

from bs4 import BeautifulSoup
import requests 
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser') 
...
<your code here>

此外,您可以添加另一组标头,使其假装成合法的浏览器.添加一些其他标题,如下所示:

Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:

headers = { 
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip', 
'DNT' : '1', # Do Not Track Request Header 
'Connection' : 'close'
}

这篇关于通过Web抓取Amazon时BeautifulSoup无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆