使用 Python 请求模块下载并保存 PDF 文件 [英] Download and save PDF file with Python requests module

查看:48
本文介绍了使用 Python 请求模块下载并保存 PDF 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站下载 PDF 文件并将其保存到磁盘.我的尝试要么因编码错误而失败,要么导致 PDF 空白.

在 [1] 中:导入请求在 [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'在 [3] 中:响应 = requests.get(url)在 [4]: with open('/tmp/metadata.pdf', 'wb') as f:...: f.write(response.text)---------------------------------------------------------------------------UnicodeEncodeError 回溯(最近一次调用最后一次)<ipython-input-4-4be915a4f032>在 <module>()1 with open('/tmp/metadata.pdf', 'wb') as f:---->2 f.write(response.text)3UnicodeEncodeError: 'ascii' 编解码器无法对位置 11-14 中的字符进行编码:序号不在范围内 (128)在 [5]:导入编解码器在 [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:...: f.write(response.text)...:

我知道这是某种编解码器问题,但我似乎无法让它工作.

解决方案

在这种情况下,您应该使用 response.content:

with open('/tmp/metadata.pdf', 'wb') as f:f.写(响应.内容)

来自文档:

<块引用>

对于非文本请求,您还可以以字节形式访问响应正文:

<预><代码>>>>内容b'[{"repository":{"open_issues":0,"url":"https://github.com/...

这意味着:response.text 将输出作为字符串对象返回,在下载文本文件时使用它.如HTML文件等

并且 response.content 将输出作为字节对象返回,当您下载二进制文件时使用它.如PDF文件、音频文件、图片等

<小时>

您也可以使用 response.raw 代替.但是,当您要下载的文件很大时使用它.以下是您也可以在文档中找到的基本示例:

导入请求url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'r = requests.get(网址,流=真)使用 open('/tmp/metadata.pdf', 'wb') 作为 fd:对于 r.iter_content(chunk_size) 中的块:fd.write(块)

chunk_size 是您要使用的块大小.如果您将其设置为 2000,则请求将下载该文件的前 2000 个字节,将它们写入文件,并一次又一次地执行此操作,除非它完成.

这样可以节省您的 RAM.但在这种情况下,我更喜欢使用 response.content 代替,因为您的文件很小.如您所见,使用 response.raw 很复杂.

<小时>

相关:

I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.

In [1]: import requests

In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'

In [3]: response = requests.get(url)

In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
   ...:     f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
      1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2     f.write(response.text)
      3 

UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)

In [5]: import codecs

In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
   ...:     f.write(response.text)
   ...: 

I know it is a codec problem of some kind but I can't seem to get it to work.

解决方案

You should use response.content in this case:

with open('/tmp/metadata.pdf', 'wb') as f:
    f.write(response.content)

From the document:

You can also access the response body as bytes, for non-text requests:

>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.

And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.


You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:

import requests

url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)

with open('/tmp/metadata.pdf', 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.

So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.


Relates:

这篇关于使用 Python 请求模块下载并保存 PDF 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆