Python:urlretrieve PDF下载 [英] Python: urlretrieve PDF downloading

查看:302
本文介绍了Python:urlretrieve PDF下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Python中使用urllib的urlretrieve()函数,以便尝试从网站上获取一些pdf.它(至少对我而言)已停止工作,并正在下载损坏的数据(15 KB而不是164 KB).

I am using urllib's urlretrieve() function in Python in order to try to grab some pdf's from websites. It has (at least for me) stopped working and is downloading damaged data (15 KB instead of 164 KB).

我已经用多个pdf进行了测试,但都没有成功(即 random.pdf ).我似乎无法正常工作,我需要能够下载我正在从事的项目的pdf.

I have tested this with several pdf's, all with no success (ie random.pdf). I can't seem to get it to work, and I need to be able to download pdf's for the project I am working on.

以下是我用来下载pdf的那种代码的示例(并使用

Here is an example of the kind of code I am using to download the pdf's (and parse the text using pdftotext.exe):

def get_html(url): # gets html of page from Internet
    import os
    import urllib2
    import urllib
    from subprocess import call
    f_name = url.split('/')[-2] # get file name (url must end with '/')
    try:
        if f_name.split('.')[-1] == 'pdf': # file type
            urllib.urlretrieve(url, os.getcwd() + '\\' + f_name)
            call([os.getcwd() + '\\pdftotext.exe', os.getcwd() + '\\' + f_name]) # use xpdf to output .txt file
            return open(os.getcwd() + '\\' + f_name.split('.')[0] + '.txt').read()
        else:
            return urllib2.urlopen(url).read()
    except:
        print 'bad link: ' + url    
        return ""

我是一名程序员新手,所以任何输入都会很棒!谢谢

I am a novice programmer, so any input would be great! Thanks

推荐答案

我建议尝试请求 .这是一个非常不错的库,将所有实现隐藏在一个简单的api后面.

I would suggest trying out requests. It is a really nice library that hides all of the implementation behind a simple api.

>>> import requests
>>> req = requests.get("http://www.mathworks.com/moler/random.pdf")
>>> len(req.content)
167633
>>> req.headers
{'content-length': '167633', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.3 (Red Hat) mod_jk/1.2.31 PHP/5.3.13 Phusion_Passenger/3.0.9 mod_perl/2.0.4 Perl/v5.8.8', 'last-modified': 'Fri, 15 Feb 2008 17:11:12 GMT', 'connection': 'keep-alive', 'etag': '"30863b-28ed1-446357e3d4c00"', 'date': 'Sun, 03 Feb 2013 05:53:21 GMT', 'content-type': 'application/pdf'}

顺便说一句,您之所以只能下载15kb,是因为您的网址错误.应该是

By the way, the reason you are only getting a 15kb download is because your url is wrong. It should be

http://www.mathworks.com/moler/random.pdf

但是你正在得到

http://www.mathworks.com/moler/random.pdf/

>>> import requests
>>> c = requests.get("http://www.mathworks.com/moler/random.pdf/")
>>> len(c.content)
14390

这篇关于Python:urlretrieve PDF下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆