为什么我不能通过BeautifulSoup刮擦亚马逊? [英] Why can't I scrape Amazon by BeautifulSoup?

查看:111
本文介绍了为什么我不能通过BeautifulSoup刮擦亚马逊?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的python代码:

Here is my python code:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

它适用于google.com和许多其他网站,但不适用于amazon.com。

it works for google.com and many other websites, but it doesn't work for amazon.com.

我可以在浏览器中打开amazon.com,但是结果汤仍然没有。

I can open amazon.com in my browser, but the resulting "soup" is still none.

此外,我发现它也无法从appannie.com抓取。但是,代码没有返回任何错误,而是返回了一个错误:

Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:

HTTPError: HTTP Error 503: Service Temporarily Unavailable 

所以我怀疑Amazon和App Annie是否阻止抓取。

So I doubt whether Amazon and App Annie block scraping.

请您自己尝试,而不是直接投票给问题:(

Please do try by yourself instead of just voting down to the question :(

谢谢

推荐答案

添加标题,然后它将起作用。

Add a header, then it will work.

from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"

# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup

这篇关于为什么我不能通过BeautifulSoup刮擦亚马逊?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆