如何使用来自http url的原始数据在python中下载ms word docx文件 [英] How to download ms word docx file in python with raw data from http url
问题描述
如果在浏览器中点击以下网址,将下载 docx 文件,我想用 python 自动下载.
if the following url is hit in browser the docx file will be downloaded i want to automate the download with python.
我已经尝试了以下
from docx import Document
import requests
import json
from bs4 import BeautifulSoup
dwnurl = 'https://hudoc.echr.coe.int/app/conversion/docx/?library=ECHR&id=001-176931&filename=CASE%20OF%20NDIDI%20v.%20THE%20UNITED%20KINGDOM.docx&logEvent=False'
doc = requests.get(dwnurl)
print(doc.content) #printing the document like b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00!\xfb\x16\x01\x16\x02\x00\x00\xec\x0c\x00\x00\x13\x00\xc4\x01[Content_Types].xml \xa2\xc0\
print(doc.raw) #printing the document like <urllib3.response.HTTPResponse object at 0x063D8BD0>
document = Document(doc.content)
document.save('test.docx')
#on document.save i have facing these issues
回溯(最近一次调用最后一次):文件scraping_hudoc.py",第 40 行,在 <module> 中文档 = 文档(文档.内容)文件C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\api.py",第 25 行,在文档中document_part = Package.open(docx).main_document_part文件C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\opc\package.py",第 116 行,打开pkg_reader = PackageReader.from_file(pkg_file)文件C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\opc\pkgreader.py",第 32 行,在 from_filephys_reader = PhysPkgReader(pkg_file)文件C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\opc\phys_pkg.py",第 101 行,在 __init__self._zipf = ZipFile(pkg_file, 'r')文件C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py",第 1108 行,在 __init__ 中self._RealGetContents()文件C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py",第 1171 行,在 _RealGetContentsendrec = _EndRecData(fp)文件C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py",第 241 行,在 _EndRecDatafpin.seek(0, 2)AttributeError: 'bytes' 对象没有属性 'seek'
推荐答案
我已经通过这个保存了ms word docx文件
i have saved the ms word docx file through this
import requests
def save_link(book_link, book_name):
the_book = requests.get(book_link, stream=True)
with open(book_name, 'wb') as f:
for chunk in the_book.iter_content(1024 * 1024 * 2): # 2 MB chunks
f.write(chunk)
save_link("https://hudoc.echr.coe.int/app/conversion/docx/?library=ECHR&id=001-176931&filename=CASE%20OF%20NDIDI%20v.%20THE%20UNITED%20KINGDOM.docx&logEvent=False","CASE OF NDIDI v. THE UNITED KINGDOM.docx")
这篇关于如何使用来自http url的原始数据在python中下载ms word docx文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!