维基百科是否允许通过Google App Engine进行网址抓取? [英] Does Wikipedia allow URL fetching via Google App Engine?
问题描述
代码片段(它是Python!):
import os
从google.appengine.ext.webapp导入urllib2
导入模板
class MainHandler(webapp.RequestHandler):
def get(self):
url =http://wikipedia.org
try:
result = urllib2.urlopen(url)
除urllib2.URLError ,e:
result ='ahh sky is falling'
template_values = {
'test':result,
}
path = os.path.join( os.path.dirname(__ file__),'index.html')
self.response.out.write(template.render(path,template_values))
urllib2
默认用户代理被禁止进入维基百科,在403 HTTP响应中。
您应该用类似这样的方式修改您的应用程序用户代理:
#选项1
导入urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent '''MyUserAgent')]
res = opener.open('http://whatsmyuseragent.com/')
page = res.read()
#选项2
导入urllib2
req = urllib2.Request('http://whatsmyuseragent.com/')
req.add_header('User-agent','MyUserAgent')
urllib2 .urlopen(req)
#选项3
req = urllib2.Request(http://whatsmyuseragent.com/,
headers = {User-agent: MyUserAgent})
urllib2.urlopen(req)
奖励链接:
高级维基百科Python客户端
http://www.mediawiki.org/wiki/API:Client_code#Python
I am writing a Python web app and in it I plan to leverage Wikipedia. When trying out some URL Fetching code I was able to fetch both Google and Facebook (via Google App Engine services), but when I attempted to fetch wikipedia.org, I received an exception. Can anyone confirm that Wikipedia does not accept these types of page requests? How can Wikipedia distinguish between me and a user?
Code snippet (it's Python!):
import os
import urllib2
from google.appengine.ext.webapp import template
class MainHandler(webapp.RequestHandler):
def get(self):
url = "http://wikipedia.org"
try:
result = urllib2.urlopen(url)
except urllib2.URLError, e:
result = 'ahh the sky is falling'
template_values= {
'test':result,
}
path = os.path.join(os.path.dirname(__file__), 'index.html')
self.response.out.write(template.render(path, template_values))
urllib2
default user-agent is banned from wikipedia and it results in a 403 HTTP response.
You should modify your application user-agent with something like this:
#Option 1
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'MyUserAgent')]
res= opener.open('http://whatsmyuseragent.com/')
page = res.read()
#Option 2
import urllib2
req = urllib2.Request('http://whatsmyuseragent.com/')
req.add_header('User-agent', 'MyUserAgent')
urllib2.urlopen(req)
#Option 3
req = urllib2.Request("http://whatsmyuseragent.com/",
headers={"User-agent": "MyUserAgent"})
urllib2.urlopen(req)
Bonus link:
High level Wikipedia Python Clients
http://www.mediawiki.org/wiki/API:Client_code#Python
这篇关于维基百科是否允许通过Google App Engine进行网址抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!