从Amazon的所有页面中使用BeautifulSoup刮取链接会导致错误 [英] Scraping links with BeautifulSoup from all pages in Amazon results in error

查看:56
本文介绍了从Amazon的所有页面中使用BeautifulSoup刮取链接会导致错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过浏览每个页面来从Amazon Webshop抓取产品URL.

I'm trying to scrape product URLs from the Amazon Webshop, by going through every page.

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64;     x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate",     "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

products = set()
for i in range(1, 21):
    url = 'https://www.amazon.fr/s?k=phone%2Bcase&page=' + str(i)
    response = requests.get(url, headers=headers)

    soup = BeautifulSoup(response.content)

    print(soup) # prints the HTML content saying Error on Amazon's side

    links = soup.select('a.a-link-normal.a-text-normal')

    for tag in links:
        url_product = 'https://www.amazon.fr' + tag.attrs['href']
        products.add(url_product)

我没有得到页面的内容,而是得到了一个对不起,我们这边出了点问题" HTML错误页面.这背后的原因是什么?如何成功绕过此错误并刮擦产品?

Instead of getting the content of the page, I get a "Sorry, something went wrong on our end" HTML Error Page. What is the reason behind this? How can I successfully bypass this error and scrape the products?

推荐答案

根据您的问题:

被告知 AMAZON 不允许自动访问其数据!因此,您可以通过 r.status_code 检查响应来再次检查!这可能会导致您遇到该错误,味精:

Be informed that AMAZON not allowing automated access to for it's data! So you can double check this by checking the response via r.status_code ! which can lead you to have that error MSG:

To discuss automated access to Amazon data please contact api-services-support@amazon.com

因此,您可以使用 AMAZON API 或通过 proxies = list_proxies proxies 的列表传递给GET请求.

Therefore you can use AMAZON API or you can pass a list of proxies to the GET request via proxies = list_proxies.

这是将 headers 传递给 Amazon 而不阻塞的正确方法,并且有效.

Here's the correct way to pass headers to Amazon without getting block and it's Works.

import requests
from bs4 import BeautifulSoup

headers = {
    'Host': 'www.amazon.fr',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'TE': 'Trailers'
}

for item in range(1, 21):
    r = requests.get(
        'https://www.amazon.fr/s?k=phone+case&page={item}&ref=sr_pg_{item}', headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    for item in soup.findAll('a', attrs={'class': 'a-link-normal a-text-normal'}):
        print(f"https://www.amazon.fr{item.get('href')}")

在线运行:单击此处

这篇关于从Amazon的所有页面中使用BeautifulSoup刮取链接会导致错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆