使用 Python 下载 URL 中未明确引用的文档 [英] Using Python to download a document that's not explicitly referenced in a URL

查看:26
本文介绍了使用 Python 下载 URL 中未明确引用的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Bing API 在 Python 2.6 中编写了一个网络爬虫,它会搜索某些文档,然后下载它们进行分类.我一直在使用字符串方法和 urllib.urlretrieve() 来下载 URL 以 .pdf、.ps 等结尾的结果,但是当文档隐藏"在 URL 后面时我遇到了麻烦喜欢:

I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:

http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En

所以,两个问题.一般有没有办法判断一个 URL 是否有一个 pdf/doc 等文件,如果它没有明确链接到它(例如 www.domain.com/file.pdf)?有没有办法让 Python 获取该文件?

So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?

感谢您的回复,其中一些建议下载文件以查看其类型是否正确.唯一的问题是......我不知道该怎么做(见上面的问题#2).urlretrieve(<above url>) 只给出一个带有包含相同 url 的 href 的 html 文件.

Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.

推荐答案

在这种情况下,您所说的未在 URL 中明确引用的文档"似乎就是所谓的重定向".基本上,服务器会告诉您必须从另一个 URL 获取文档.通常,python 的 urllib 会自动跟随这些重定向,以便您最终获得正确的文件.(并且 - 正如其他人已经提到的 - 您可以检查响应的 mime-type 标头以查看它是否为 pdf).

In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).

然而,有问题的服务器在这里做了一些奇怪的事情.您请求 url,它会将您重定向到另一个 url.您请求另一个网址,它再次将您重定向...到同一个网址!再一次......再一次......在某些时候,urllib 认为这已经足够了,并且将停止跟随重定向,以避免陷入无限循环.

However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.

那么,为什么您在使用浏览器时能够获得 pdf?因为显然,如果您启用了 cookie,服务器只会提供 pdf.(为什么?你必须问负责服务器的人......)如果你没有 cookie,它只会永远重定向你.

So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.

(检查 urllib2cookielib 模块以获得对 cookie 的支持,本教程 可能会有所帮助)

(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)

至少,这就是我认为导致问题的原因.我还没有真正尝试过用 cookie 来做.也可能是服务器不想要"提供 pdf,因为它检测到您没有使用普通"浏览器(在这种情况下,您可能需要摆弄 User-Agent 标头),但它这样做会是一种奇怪的方式.所以我的猜测是它在某个地方使用了会话 cookie",如果你还没有,请继续尝试重定向.

At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.

这篇关于使用 Python 下载 URL 中未明确引用的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆