使用 Python 请求模块下载并保存 PDF 文件 [英] Download and save PDF file with Python requests module
问题描述
我正在尝试从网站下载 PDF 文件并将其保存到磁盘.我的尝试要么因编码错误而失败,要么导致 PDF 空白.
在 [1] 中:导入请求在 [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'在 [3] 中:响应 = requests.get(url)在 [4]: with open('/tmp/metadata.pdf', 'wb') as f:...: f.write(response.text)---------------------------------------------------------------------------UnicodeEncodeError 回溯(最近一次调用最后一次)<ipython-input-4-4be915a4f032>在 <module>()1 with open('/tmp/metadata.pdf', 'wb') as f:---->2 f.write(response.text)3UnicodeEncodeError: 'ascii' 编解码器无法对位置 11-14 中的字符进行编码:序号不在范围内 (128)在 [5]:导入编解码器在 [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:...: f.write(response.text)...:
我知道这是某种编解码器问题,但我似乎无法让它工作.
在这种情况下,您应该使用 response.content
:
with open('/tmp/metadata.pdf', 'wb') as f:f.写(响应.内容)
来自文档:
<块引用>对于非文本请求,您还可以以字节形式访问响应正文:
<预><代码>>>>内容b'[{"repository":{"open_issues":0,"url":"https://github.com/...这意味着:response.text
将输出作为字符串对象返回,在下载文本文件时使用它.如HTML文件等
并且 response.content
将输出作为字节对象返回,当您下载二进制文件时使用它.如PDF文件、音频文件、图片等
您也可以使用 response.raw
代替.但是,当您要下载的文件很大时使用它.以下是您也可以在文档中找到的基本示例:
导入请求url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'r = requests.get(网址,流=真)使用 open('/tmp/metadata.pdf', 'wb') 作为 fd:对于 r.iter_content(chunk_size) 中的块:fd.write(块)
chunk_size
是您要使用的块大小.如果您将其设置为 2000
,则请求将下载该文件的前 2000
个字节,将它们写入文件,并一次又一次地执行此操作,除非它完成.
这样可以节省您的 RAM.但在这种情况下,我更喜欢使用 response.content
代替,因为您的文件很小.如您所见,使用 response.raw
很复杂.
相关:
I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.
You should use response.content
in this case:
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text
return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content
return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw
instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
chunk_size
is the chunk size which you want to use. If you set it as 2000
, then requests will download that file the first 2000
bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content
instead in this case since your file is small. As you can see use response.raw
is complex.
Relates:
这篇关于使用 Python 请求模块下载并保存 PDF 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!