网站已启动并正在运行,但对其进行解析会导致HTTP错误503 [英] Website is up and running but parsing it results in HTTP Error 503

查看:133
本文介绍了网站已启动并正在运行,但对其进行解析会导致HTTP错误503的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用urllib2库对网页进行爬网,并根据需要提取一些信息.我可以自由浏览该网站(从一个链接到另一个链接,依此类推),但是当我尝试解析它时,却出现了错误

HTTP Error 503 : Service Temporarily Unavailable

我在网上进行了搜索,发现当网站服务器当时不可用"时出现此错误

阅读此书后,我感到困惑,如果网站服务器关闭,那么它如何启动和运行(因为我能够浏览网页),如果服务器没有关闭,那么我为什么会收到此503错误. /p>

服务器是否有可能采取了某些措施来阻止网页的解析

谢谢.

解决方案

很可能您的用户代理已被禁止访问服务器,从而避免了Web爬网程序.因此,某些网站(包括Wikipedia)在使用不需要的用户代理(例如wget,curl,urllib等)时会显示50x错误

但是,更改用户代理可能就足够了.至少对于Wikipedia来说就是这样,当使用Firefox用户代理时,它就可以正常工作. (禁止"很可能仅依赖于用户代理).

最后,这些网站一定有理由禁止网络爬虫.根据您的工作,可能要使用其他解决方案.例如,维基百科提供了数据库转储,如果您打算进行密集的转储,这会很方便.使用它.

PS. Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11是我在我的项目中用于维基百科的用户代理.

I want to crawl a webpage using urllib2 library and extract some information according to my need. I am able to freely navigate the site(going from one link to another and so on), but when I try to parse-it I am getting an error

HTTP Error 503 : Service Temporarily Unavailable

I searched about it on net and found out that this error occurs when "web site's server is not available at that time"

I am confused after reading this, if website server is down then how come its up and running(since I am able to navigate the webpage), and if the server is not down then why I am getting this 503 Error.

Is their a possibility that the server has done something to prevent the parsing of web-page

Thanks in advance.

解决方案

Most probably your user-agent is banned from the server, so as to avoid, well, web crawlers. Therefore some websites, including Wikipedia, show up a 50x error when using an unwanted user-agent (such as wget, curl, urllib, …)

However, changing the user-agent might be enough. At least, it's the case for Wikipedia, which works just fine when using Firefox user agent. (The "bann" most probably only relies on the user-agent).

Finally, there must be a reason for those websites to ban web crawlers. Depending on what you're working on, you might want to use another solution. For example, wikipedia provides database dumps, which can be convenient if you intend to make an intensive use of it.

PS. Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11 is the user-agent I use for wikipedia on a project of mine.

这篇关于网站已启动并正在运行,但对其进行解析会导致HTTP错误503的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆