用美丽的汤抓取问题 [英] Issue scraping with Beautiful Soup
问题描述
在使用相同技术之前,我一直在抓取网站.但是使用此网站似乎无法正常工作.
I've been scraping websites before using this same technique. But with this website it seems to not work.
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.weatheronline.co.uk/weather/maps/current?LANG=en&DATE=1354104000&CONT=euro&LAND=UK&KEY=UK&SORT=1&INT=06&TYP=sonne&ART=tabelle&RUBRIK=akt&R=310&CEL=C"
page=urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
print soup
在输出中应该是网页的内容,但我只是得到这个:
In the output should be the content of the webpage but instead I am just getting this:
GIF89a(它后面还有一些我无法在此处复制的符号)
GIF89a (it follows also some symbols I can't copy here)
关于问题是什么以及我应该如何处理的任何想法.
Any ideas of what the problem is and how should I proceed.
推荐答案
但是我想知道为什么我得到这样的gif来添加URL的原因 并且当我通过浏览器访问它时,我可以完美地访问网站.
but I want to know why I am getting a gif accesing the url like that and when I access it via my browser I get the website perfectly.
因为这些人很聪明,并且不希望通过网络浏览器访问他们的网站.您需要做的是通过将User-agent添加到标头来伪造一个已知的浏览器.这是一个可以使用的修改示例
because these guys are smart and don't want their website to be accessed outside a web browser. What you need to do is to fake a known browser by adding User-agent to the header. Here is a modified example that will work
>>> import urllib2
>>> opener = urllib2.build_opener()
>>> opener.addheaders = [('User-agent', 'Mozilla/5.0')]
>>> url = "http://www.weatheronline.co.uk/weather/maps/current?LANG=en&DATE=1354104000&CONT=euro&LAND=UK&KEY=UK&SORT=1&INT=06&TYP=sonne&ART=tabelle&RUBRIK=akt&R=310&CEL=C"
>>> response = opener.open(url)
>>> page = response.read()
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(page)
这篇关于用美丽的汤抓取问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!