为什么我不能通过BeautifulSoup刮擦亚马逊? [英] Why can't I scrape Amazon by BeautifulSoup?
本文介绍了为什么我不能通过BeautifulSoup刮擦亚马逊?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
这是我的python代码:
Here is my python code:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
它适用于google.com和许多其他网站,但不适用于amazon.com。
it works for google.com and many other websites, but it doesn't work for amazon.com.
我可以在浏览器中打开amazon.com,但是结果汤仍然没有。
I can open amazon.com in my browser, but the resulting "soup" is still none.
此外,我发现它也无法从appannie.com抓取。但是,代码没有返回任何错误,而是返回了一个错误:
Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
所以我怀疑Amazon和App Annie是否阻止抓取。
So I doubt whether Amazon and App Annie block scraping.
请您自己尝试,而不是直接投票给问题:(
Please do try by yourself instead of just voting down to the question :(
谢谢
推荐答案
添加标题,然后它将起作用。
Add a header, then it will work.
from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"
# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup
这篇关于为什么我不能通过BeautifulSoup刮擦亚马逊?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文