从Amazon的所有页面中使用BeautifulSoup刮取链接会导致错误 [英] Scraping links with BeautifulSoup from all pages in Amazon results in error
问题描述
我试图通过浏览每个页面来从Amazon Webshop抓取产品URL.
I'm trying to scrape product URLs from the Amazon Webshop, by going through every page.
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
products = set()
for i in range(1, 21):
url = 'https://www.amazon.fr/s?k=phone%2Bcase&page=' + str(i)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup) # prints the HTML content saying Error on Amazon's side
links = soup.select('a.a-link-normal.a-text-normal')
for tag in links:
url_product = 'https://www.amazon.fr' + tag.attrs['href']
products.add(url_product)
我没有得到页面的内容,而是得到了一个对不起,我们这边出了点问题" HTML错误页面.这背后的原因是什么?如何成功绕过此错误并刮擦产品?
Instead of getting the content of the page, I get a "Sorry, something went wrong on our end" HTML Error Page. What is the reason behind this? How can I successfully bypass this error and scrape the products?
推荐答案
根据您的问题:
被告知 AMAZON
不允许自动访问其数据!因此,您可以通过 r.status_code
检查响应来再次检查!这可能会导致您遇到该错误,味精:
Be informed that AMAZON
not allowing automated access to for it's data! So you can double check this by checking the response via r.status_code
! which can lead you to have that error MSG:
To discuss automated access to Amazon data please contact api-services-support@amazon.com
因此,您可以使用 AMAZON API
或通过 proxies = list_proxies
将 proxies
的列表传递给GET请求.
Therefore you can use AMAZON API
or you can pass a list of proxies
to the GET request via proxies = list_proxies
.
这是将 headers
传递给 Amazon
而不阻塞的正确方法,并且有效.
Here's the correct way to pass headers
to Amazon
without getting block and it's Works.
import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'www.amazon.fr',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
for item in range(1, 21):
r = requests.get(
'https://www.amazon.fr/s?k=phone+case&page={item}&ref=sr_pg_{item}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', attrs={'class': 'a-link-normal a-text-normal'}):
print(f"https://www.amazon.fr{item.get('href')}")
在线运行:单击此处
这篇关于从Amazon的所有页面中使用BeautifulSoup刮取链接会导致错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!