正在用python大块下载文件吗? [英] Downloading files in chunks in python?
问题描述
我正在编写一个简单的同步下载管理器,该管理器将分10个部分下载视频文件.我正在使用 requests
从标头获取内容长度.我正在使用它破坏并下载10个文件.字节块,然后将它们合并以形成完整的视频.下面的代码假定可以这种方式工作,但是最终的合并文件只能工作几秒钟,然后损坏.我的代码有什么问题?
I am writing a simple synchronous download manager which downloads a video file in 10 sections. I am using requests
to get content-length from headers. Using this I am breaking and downloading files in 10; byte chunks and then merging them to form a complete video. The code below suppose to work this way but the end merged file only works for seconds and after that it gets corrupted. What is wrong in my code?
import requests
import os
def intervals(parts, duration):
part_duration = duration // parts
return [(i * part_duration, (i + 1) * part_duration) for i in range(parts)]
home = os.path.expanduser("~")
if not os.path.exists(home+'/Desktop/temp'):
os.makedirs(home+'/Desktop/temp')
PATH = home+"/Desktop/temp/tmp.mp4"
example_file_url = "https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4"
req = requests.head(example_file_url)
size = int(req.headers['Content-Length'])
content_section = 10
section_intervals = intervals(content_section,size)
with open(PATH, "wb") as file:
for i,(start,end) in enumerate(section_intervals):
headers = {"Range": "bytes="+str(start)+"-"+str(end)}
print(headers)
r = requests.get(example_file_url, headers=headers)
file.write(r.content)
推荐答案
问题
您的范围是错误的,因为 Range
标头指定的间隔给出了第一个和最后一个偏移量,例如 bytes = 0-10
表示从0到10的11个字节(与slice在python中的工作方式不同),因此 bytes = 0-10
和 bytes = 10-20
是重叠范围.例如,您将需要 bytes = 0-9
后跟 bytes = 10-19
.
The problem
Your ranges are wrong because the interval specified by a Range
header gives the first and the last offset, e.g. bytes=0-10
means 11 bytes from 0 to 10 (unlike how slices work in python), so bytes=0-10
and bytes=10-20
are overlapping ranges. For example, you would need bytes=0-9
followed by bytes=10-19
instead.
请参见此文档中的示例:>
See the example in this documentation:
请求头1024个字节的标头...
范围:字节= 0-1023
(而python slice中的 [0:1023]
长度为1023).
(whereas [0:1023]
in a python slice would be length 1023).
在您说它工作几秒钟,然后损坏"的地方,我假设您的意思是它对解码的MP4输出的前几秒钟有效.中断的地方将是第一个下载部分的末尾,其中第一部分的最后一个字节在第二个部分的开始处重复.
Where you say that it "works for seconds and after that gets corrupted", I assume that you mean that it is valid for the first few seconds of decoded MP4 output. The point where it breaks will be the end of the first downloaded part, where the final byte of the first part is duplicated at the start of the second part.
另一个问题是,您的总长度是错误的,因为您将整数除以 parts
,然后再乘以整数时,就失去了最后的小数部分.
Another problem is that your total length is wrong because you do integer division by parts
and then by the time that you multiply it up again, you have lost the final fractional part.
将您的 intervals
函数更改为此,它可以起作用:
Change your intervals
function to this, and it works:
import math
def intervals(parts, duration):
part_duration = math.ceil(duration / parts)
return [(start, min(start + part_duration - 1, duration - 1))
for start in range(0, duration, part_duration)]
检查范围
插入打印语句:
print("Size = ", size)
print(section_intervals)
现在给出:
Size = 9840497
[(0, 984049), (984050, 1968099), (1968100, 2952149), (2952150, 3936199), (3936200, 4920249), (4920250, 5904299), (5904300, 6888349), (6888350, 7872399), (7872400, 8856449), (8856450, 9840496)]
使用原始的 intervals
函数,它会给出:
whereas using your original intervals
function, it gives:
Size = 9840497
[(0, 984049), (984049, 1968098), (1968098, 2952147), (2952147, 3936196), (3936196, 4920245), (4920245, 5904294), (5904294, 6888343), (6888343, 7872392), (7872392, 8856441), (8856441, 9840490)]
请注意重叠的范围和末尾缺少的字节.
Note the overlapping ranges and the bytes missing from the end.
我们可以通过计算校验和最终验证下载.在此示例中,我从Linux命令行使用了 md5sum
(尽管 cksum
也可以使用,因为不需要为此使用密码校验和).
We can verify the download at the end by calculating a checksum. In this example, I use md5sum
from the Linux command line (although cksum
would work also, as there is no need for cryptographic checksum for this purpose).
我将输出称为 myoutput
.
$ md5sum myoutput
10c918b1d01aea85864ee65d9e0c2305 myoutput
现在,我还直接使用 wget< url>
下载副本,并查看它具有相同的校验和.
Now I also download a copy directly with wget <url>
and see that it has the same checksum.
$ wget https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4
--2020-07-21 08:26:52-- https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4
$ md5sum file_example_MP4_1280_10MG.mp4
10c918b1d01aea85864ee65d9e0c2305 file_example_MP4_1280_10MG.mp4
这篇关于正在用python大块下载文件吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!