使用机械化和urllib下载pdf文件 [英] Downloading pdf files using mechanize and urllib

查看:89
本文介绍了使用机械化和urllib下载pdf文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python的新手,我目前的任务是编写一个网络爬虫,该爬虫在某些网页中查找PDF文件并下载.这是我目前的方法(仅适用于1个示例网址):

I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url):

import mechanize
import urllib
import sys

mech = mechanize.Browser()
mech.set_handle_robots(False)

url = "http://www.xyz.com"

try:
    mech.open(url, timeout = 30.0)
except HTTPError, e:
    sys.exit("%d: %s" % (e.code, e.msg))

links = mech.links()

for l in links:
    #Some are relative links
    path = str(l.base_url[:-1])+str(l.url)
    if path.find(".pdf") > 0:
       urllib.urlretrieve(path)

程序运行时没有任何错误,但是我看不到pdf被保存在任何地方.我可以访问pdf并通过浏览器将其保存.有什么想法吗?我将pydev(基于eclipse)用作开发环境,如果有什么不同的话.

The program runs without any errors, but I am not seeing the pdf being saved anywhere. I am able to access the pdf and save it through my browser. Any ideas what's going on? I am using pydev (eclipse based) as my development environment, if that makes any difference.

另一个问题是,如果我想在保存时给pdf一个特定的名称,我该怎么做?这种方法正确吗?保存PDF之前,我必须创建一个带有文件名"的文件吗?

Another question is if I want give the pdf a specific name while saving it, how can I do that? Is this approach correct? Do I have to create a file with 'filename' before I can save the PDF?

urllib.urlretrieve(path, filename) 

谢谢.

推荐答案

urllib 关于urlretrieve函数的说法:

第二个参数(如果存在)指定要复制的文件位置 到(如果不存在,该位置将是一个具有生成名称的临时文件).

The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name).

函数的返回值具有文件的位置:

The function's return value has the location of the file:

返回一个元组(文件名,标头),其中filename是本地文件 可以找到对象的名称,标题就是 urlopen()返回的对象的info()方法返回(对于 远程对象,可能已缓存).

Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached).

因此,更改此行:

urllib.urlretrieve(path)

对此:

(filename, headers) = urllib.urlretrieve(path)

filename中的路径将具有位置. (可选)将filename参数传递给urlretrieve自己指定位置.

and the path in filename will have the location. Optionally, pass in the filename argument to urlretrieve to specify the location yourself.

这篇关于使用机械化和urllib下载pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆