维基百科是否允许通过Google App Engine进行网址抓取？ [英] Does Wikipedia allow URL fetching via Google App Engine?

查看：128 发布时间：2018/5/3 19:37:12 python http google-app-engine url web-applications

本文介绍了维基百科是否允许通过Google App Engine进行网址抓取？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在编写一个Python Web应用程序，并计划利用Wikipedia。当试用一些URL获取代码时，我能够通过Google App Engine服务获取Google和Facebook，但是当我试图获取wikipedia.org时，我收到了一个异常。任何人都可以确认维基百科不接受这些类型的页面请求吗？维基百科如何区分我和用户？

代码片段（它是Python！）：

  import os 
从google.appengine.ext.webapp导入urllib2 
导入模板
 
 
 class MainHandler（webapp.RequestHandler）：
 def get（self）：
 url =http://wikipedia.org
 try：
 result = urllib2.urlopen（url）
除urllib2.URLError ，e：
 result ='ahh sky is falling'
 template_values = {
'test'：result，
} 
 path = os.path.join（ os.path.dirname（__ file__），'index.html'）
 self.response.out.write（template.render（path，template_values））

解决方案
urllib2 默认用户代理被禁止进入维基百科，在403 HTTP响应中。

您应该用类似这样的方式修改您的应用程序用户代理：
＃选项1 导入urllib2 opener = urllib2.build_opener（） opener.addheaders = [（'User-agent '''MyUserAgent'）] res = opener.open（'http://whatsmyuseragent.com/'） page = res.read（）＃选项2 导入urllib2 req = urllib2.Request（'http://whatsmyuseragent.com/'） req.add_header（'User-agent'，'MyUserAgent'） urllib2 .urlopen（req）＃选项3 req = urllib2.Request（http://whatsmyuseragent.com/， headers = {User-agent： MyUserAgent}） urllib2.urlopen（req）
奖励链接：

高级维基百科Python客户端
 http://www.mediawiki.org/wiki/API:Client_code#Python

I am writing a Python web app and in it I plan to leverage Wikipedia. When trying out some URL Fetching code I was able to fetch both Google and Facebook (via Google App Engine services), but when I attempted to fetch wikipedia.org, I received an exception. Can anyone confirm that Wikipedia does not accept these types of page requests? How can Wikipedia distinguish between me and a user?

Code snippet (it's Python!):
import os import urllib2 from google.appengine.ext.webapp import template class MainHandler(webapp.RequestHandler): def get(self): url = "http://wikipedia.org" try: result = urllib2.urlopen(url) except urllib2.URLError, e: result = 'ahh the sky is falling' template_values= { 'test':result, } path = os.path.join(os.path.dirname(__file__), 'index.html') self.response.out.write(template.render(path, template_values))

解决方案
urllib2 default user-agent is banned from wikipedia and it results in a 403 HTTP response.
You should modify your application user-agent with something like this:
#Option 1 import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'MyUserAgent')] res= opener.open('http://whatsmyuseragent.com/') page = res.read() #Option 2 import urllib2 req = urllib2.Request('http://whatsmyuseragent.com/') req.add_header('User-agent', 'MyUserAgent') urllib2.urlopen(req) #Option 3 req = urllib2.Request("http://whatsmyuseragent.com/", headers={"User-agent": "MyUserAgent"}) urllib2.urlopen(req)
Bonus link:
High level Wikipedia Python Clients http://www.mediawiki.org/wiki/API:Client_code#Python

这篇关于维基百科是否允许通过Google App Engine进行网址抓取？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

维基百科是否允许通过Google App Engine进行网址抓取？ [英] Does Wikipedia allow URL fetching via Google App Engine?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

维基百科是否允许通过Google App Engine进行网址抓取？ [英] Does Wikipedia allow URL fetching via Google App Engine?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭