正在用python大块下载文件吗? [英] Downloading files in chunks in python?

查看:57
本文介绍了正在用python大块下载文件吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个简单的同步下载管理器,该管理器将分10个部分下载视频文件.我正在使用 requests 从标头获取内容长度.我正在使用它破坏并下载10个文件.字节块,然后将它们合并以形成完整的视频.下面的代码假定可以这种方式工作,但是最终的合并文件只能工作几秒钟,然后损坏.我的代码有什么问题?

I am writing a simple synchronous download manager which downloads a video file in 10 sections. I am using requests to get content-length from headers. Using this I am breaking and downloading files in 10; byte chunks and then merging them to form a complete video. The code below suppose to work this way but the end merged file only works for seconds and after that it gets corrupted. What is wrong in my code?

import requests
import os

def intervals(parts, duration):
    part_duration = duration // parts
    return [(i * part_duration, (i + 1) * part_duration) for i in range(parts)]

home = os.path.expanduser("~")
if not os.path.exists(home+'/Desktop/temp'):
    os.makedirs(home+'/Desktop/temp')

PATH = home+"/Desktop/temp/tmp.mp4"

example_file_url = "https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4"


req = requests.head(example_file_url)

size = int(req.headers['Content-Length'])

content_section = 10

section_intervals = intervals(content_section,size)


with  open(PATH, "wb") as file:
    for i,(start,end) in enumerate(section_intervals):
        headers = {"Range": "bytes="+str(start)+"-"+str(end)}
        print(headers)
        r = requests.get(example_file_url, headers=headers)
        file.write(r.content)

推荐答案

问题

您的范围是错误的,因为 Range 标头指定的间隔给出了第一个和最后一个偏移量,例如 bytes = 0-10 表示从0到10的11个字节(与slice在python中的工作方式不同),因此 bytes = 0-10 bytes = 10-20是重叠范围.例如,您将需要 bytes = 0-9 后跟 bytes = 10-19 .

The problem

Your ranges are wrong because the interval specified by a Range header gives the first and the last offset, e.g. bytes=0-10 means 11 bytes from 0 to 10 (unlike how slices work in python), so bytes=0-10 and bytes=10-20 are overlapping ranges. For example, you would need bytes=0-9 followed by bytes=10-19 instead.

请参见此文档中的示例:

See the example in this documentation:

请求头1024个字节的标头... 范围:字节= 0-1023

(而python slice中的 [0:1023] 长度为1023).

(whereas [0:1023] in a python slice would be length 1023).

在您说它工作几秒钟,然后损坏"的地方,我假设您的意思是它对解码的MP4输出的前几秒钟有效.中断的地方将是第一个下载部分的末尾,其中第一部分的最后一个字节在第二个部分的开始处重复.

Where you say that it "works for seconds and after that gets corrupted", I assume that you mean that it is valid for the first few seconds of decoded MP4 output. The point where it breaks will be the end of the first downloaded part, where the final byte of the first part is duplicated at the start of the second part.

另一个问题是,您的总长度是错误的,因为您将整数除以 parts ,然后再乘以整数时,就失去了最后的小数部分.

Another problem is that your total length is wrong because you do integer division by parts and then by the time that you multiply it up again, you have lost the final fractional part.

将您的 intervals 函数更改为此,它可以起作用:

Change your intervals function to this, and it works:

import math

def intervals(parts, duration):
    part_duration = math.ceil(duration / parts)
    return [(start, min(start + part_duration - 1, duration - 1)) 
             for start in range(0, duration, part_duration)]

检查范围

插入打印语句:

print("Size = ", size)
print(section_intervals)

现在给出:

Size =  9840497
[(0, 984049), (984050, 1968099), (1968100, 2952149), (2952150, 3936199), (3936200, 4920249), (4920250, 5904299), (5904300, 6888349), (6888350, 7872399), (7872400, 8856449), (8856450, 9840496)]

使用原始的 intervals 函数,它会给出:

whereas using your original intervals function, it gives:

Size =  9840497
[(0, 984049), (984049, 1968098), (1968098, 2952147), (2952147, 3936196), (3936196, 4920245), (4920245, 5904294), (5904294, 6888343), (6888343, 7872392), (7872392, 8856441), (8856441, 9840490)]

请注意重叠的范围和末尾缺少的字节.

Note the overlapping ranges and the bytes missing from the end.

我们可以通过计算校验和最终验证下载.在此示例中,我从Linux命令行使用了 md5sum (尽管 cksum 也可以使用,因为不需要为此使用密码校验和).

We can verify the download at the end by calculating a checksum. In this example, I use md5sum from the Linux command line (although cksum would work also, as there is no need for cryptographic checksum for this purpose).

我将输出称为 myoutput .

$ md5sum myoutput
10c918b1d01aea85864ee65d9e0c2305  myoutput

现在,我还直接使用 wget< url> 下载副本,并查看它具有相同的校验和.

Now I also download a copy directly with wget <url> and see that it has the same checksum.

$ wget https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4
--2020-07-21 08:26:52--  https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4

$ md5sum file_example_MP4_1280_10MG.mp4 
10c918b1d01aea85864ee65d9e0c2305  file_example_MP4_1280_10MG.mp4

这篇关于正在用python大块下载文件吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆