维基百科是否允许通过Google App Engine进行网址抓取? [英] Does Wikipedia allow URL fetching via Google App Engine?

查看:128
本文介绍了维基百科是否允许通过Google App Engine进行网址抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个Python Web应用程序,并计划利用Wikipedia。当试用一些URL获取代码时,我能够通过Google App Engine服务获取Google和Facebook,但是当我试图获取wikipedia.org时,我收到了一个异常。任何人都可以确认维基百科不接受这些类型的页面请求吗?维基百科如何区分我和用户?



代码片段(它是Python!):

  import os 
从google.appengine.ext.webapp导入urllib2
导入模板


class MainHandler(webapp.RequestHandler):
def get(self):
url =http://wikipedia.org
try:
result = urllib2.urlopen(url)
除urllib2.URLError ,e:
result ='ahh sky is falling'
template_values = {
'test':result,
}
path = os.path.join( os.path.dirname(__ file__),'index.html')
self.response.out.write(template.render(path,template_values))


解决方案

urllib2 默认用户代理被禁止进入维基百科,在403 HTTP响应中。

您应该用类似这样的方式修改您的应用程序用户代理:

 #选项1 
导入urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent '''MyUserAgent')]
res = opener.open('http://whatsmyuseragent.com/')
page = res.read()

#选项2
导入urllib2
req = urllib2.Request('http://whatsmyuseragent.com/')
req.add_header('User-agent','MyUserAgent')
urllib2 .urlopen(req)

#选项3
req = urllib2.Request(http://whatsmyuseragent.com/,
headers = {User-agent: MyUserAgent})
urllib2.urlopen(req)

奖励链接:

高级维基百科Python客户端
http://www.mediawiki.org/wiki/API:Client_code#Python


I am writing a Python web app and in it I plan to leverage Wikipedia. When trying out some URL Fetching code I was able to fetch both Google and Facebook (via Google App Engine services), but when I attempted to fetch wikipedia.org, I received an exception. Can anyone confirm that Wikipedia does not accept these types of page requests? How can Wikipedia distinguish between me and a user?

Code snippet (it's Python!):

    import os
import urllib2
from google.appengine.ext.webapp import template


class MainHandler(webapp.RequestHandler):
    def get(self):
        url = "http://wikipedia.org"
        try:
          result = urllib2.urlopen(url)
        except urllib2.URLError, e:
          result = 'ahh the sky is falling'
        template_values= {
            'test':result,
        }
        path = os.path.join(os.path.dirname(__file__), 'index.html')
        self.response.out.write(template.render(path, template_values))

解决方案

urllib2 default user-agent is banned from wikipedia and it results in a 403 HTTP response.
You should modify your application user-agent with something like this:

#Option 1
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'MyUserAgent')]
res= opener.open('http://whatsmyuseragent.com/')
page = res.read()

#Option 2
import urllib2
req = urllib2.Request('http://whatsmyuseragent.com/')
req.add_header('User-agent', 'MyUserAgent')
urllib2.urlopen(req)

#Option 3
req = urllib2.Request("http://whatsmyuseragent.com/", 
                       headers={"User-agent": "MyUserAgent"})
urllib2.urlopen(req)

Bonus link:
High level Wikipedia Python Clients http://www.mediawiki.org/wiki/API:Client_code#Python

这篇关于维基百科是否允许通过Google App Engine进行网址抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆