在python中下载超过1000个文件 [英] Downloading over 1000 files in python

查看:100
本文介绍了在python中下载超过1000个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,也许从我的代码开始:

So, maybe start from my code:

def download(fn, filename, index):
    urllib.request.urlretrieve(fn,
                     os.path.join('music', re.sub('[%s]' % ''.join(CHAR_NOTALLOWED), '', filename) + '.mp3'))
    print(str(index) + '# DOWNLOADED: ' + filename)

for index, d in enumerate(found):
    worker = Thread(target=download, args=(found[d], d, index))
    worker.setDaemon(True)
    worker.start()
worker.join()

我的问题是,当我尝试下载1000个以上的文件时,总是会收到此错误,但我不知道为什么:

My problem is that when I tried to download over 1000 files I always get this error, but I don't know why:

Traceback (most recent call last):
  File "E:/PythonProject/1.1/mp3y.py", line 238, in <module>
    worker.start()
  File "E:\python34\lib\threading.py", line 851, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

我尝试使用队列,但是遇到了相同的错误....我想将此线程分为一部分,但我不知道如何:O

I tried using a queue, but got the same error.... I wanted part this thread but I don't know how :O

推荐答案

简短版本:

with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
    for index, d in enumerate(found):
        executor.submit(download, found[d], d, index)

就是这样;一次微不足道的更改,并且比现有代码少两行,就可以完成了.

That's it; a trivial change, and two lines less than your existing code, and you're done.

那么,您现有的代码有什么问题?一次启动1000个线程始终是个坏主意.*一旦超过几十个线程,您将比节省并发性增加更多的调度程序和上下文切换开销.

So, what's wrong with your existing code? Starting 1000 threads at a time is always a bad idea.* Once you get beyond a few dozen, you're adding more scheduler and context-switching overhead than you are concurrency savings.

如果您想知道为什么它在1000左右失败,那可能是因为一个库在旧版本的Windows上运行,**,或者是因为堆栈空间不足,***.但是无论哪种方式,都没有关系.正确的解决方案是不要使用太多线程.

If you want to know why it fails right around 1000, that could be because of a library working around older versions of Windows,**, or it could be because you're running out of stack space,***. But either way, it doesn't really matter. The right solution is to not use so many threads.

通常的解决方案是使用线程池-启动大约8-12个线程,****,并让它们提取URL以从队列中下载.您可以自己构建它,也可以使用stdlib随附的concurrent.futures.ThreadPoolExecutormultiprocessing.dummy.Pool.如果您查看主要的 ThreadPoolExecutor示例文档,它几乎可以满足您的需求.实际上,您想要的甚至更简单,因为您不在乎结果.

The usual solution is to use a thread pool—start about 8-12 threads,**** and have them pull the URLs to download off a queue. You can build this yourself, or you can use the concurrent.futures.ThreadPoolExecutor or multiprocessing.dummy.Pool that come with the stdlib. If you look at the main ThreadPoolExecutor Example in the docs, it's doing almost exactly what you want. In fact, what you want is even simpler, because you don't care about the results.

请注意,您的代码中还有另一个严重的问题.如果守护线程,则不允许join它们.另外,您仅尝试加入您创建的最后一个,绝不能保证它是最后一个完成的.同样,首先将守护线程守护起来可能不是一个好主意,因为当您的主线程完成时(在等待一个任意选择的下载完成之后),其他线程可能会被打断并留下部分文件.

As a side note, you've got another serious problem in your code. If you daemonize your threads, you're not allowed to join them. Also, you're only trying to join the last one you created, which is by no means guaranteed to be the last one to finish. Also, daemonizing download threads is probably a bad idea in the first place, because when your main thread finishes (after waiting for one arbitrarily-chosen download to finish) the others may get interrupted and leave partial files behind.

此外,如果 do 要守护线程,最好的方法是将daemon=True传递给构造函数.如果您需要在创建后执行此操作,只需执行t.daemon = True.如果需要,仅调用不推荐使用的 setDaemon 函数向后兼容Python 2.5.

Also, if you do want to daemonize a thread, the best way is to pass daemon=True to the constructor. If you need to do it after creation, just do t.daemon = True. Only call the deprecated setDaemon function if you need backward compatibility to Python 2.5.

*我想我永远不应该说 ,因为在2025年,要利用成千上万个慢速内核,这将是日常工作.但是在2014年,在普通笔记本电脑/台式机/服务器硬件上,情况总是很糟糕.

* I guess I shouldn't say always, because in 2025 it'll probably be an everyday thing to do, to take advantage of your thousands of slow cores. But in 2014 on normal laptop/desktop/server hardware, it's always bad.

**当您接近1024个线程时,较旧版本的Windows(至少NT 4)具有各种奇怪的错误,因此许多线程库仅拒绝创建1000个以上的线程.尽管这里似乎不是这样,因为 Python是只是调用Microsoft自己的包装函数_beginthreadex ,而不会执行此操作.

***默认情况下,每个线程获得1MB的堆栈空间.在32位应用程序中,有一个最大的总堆栈空间,我假设您的Windows版本默认为1GB.您可以为每个线程自定义堆栈空间,也可以自定义总进程堆栈空间,但是Python既不自定义,也几乎不自定义任何其他应用程序.

****除非您的下载全部来自同一服务器,否则这种情况下,您最多可能需要4个,如果不是您的服务器,则实际上多于2个通常被认为是不礼貌的.而且为什么还要8-12?凭经验证明,很久以前就进行了很好的测试.它可能不再是最佳选择,但对于大多数用途来说可能已经足够接近了.如果您确实需要提高性能,可以使用不同的数字进行测试.

这篇关于在python中下载超过1000个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆