Python,多线程,抓取网页,下载网页 [英] Python,multi-threads,fetch webpages,download webpages

查看:46
本文介绍了Python,多线程,抓取网页,下载网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在一个站点中批量下载网页.我的urls.txt"文件中有 5000000 个 urls 链接.大约是300M.如何使多线程链接这些网址并下载这些网页?或者如何批量下载这些网页?

我的想法:

 with open('urls.txt','r') as f:对于 f 中的 el:##获取这些网址

还是扭曲?

有什么好的解决办法吗?

解决方案

如果这不是更大程序的一部分,那么 notnoop 使用一些现有工具来完成它的想法是一个很好的想法.如果调用 wget 的 shell 循环解决了您的问题,那将比涉及更多自定义软件开发的任何事情容易得多.

但是,如果您需要将这些资源作为更大程序的一部分来获取,那么使用 shell 执行它可能并不理想.在这种情况下,我强烈推荐 Twisted,它可以轻松地并行处理多个请求.

几年前,我写了一个例子来说明如何做到这一点.看看 http://jcalderone.livejournal.com/24285.html..>

I want to batch dowload webpages in one site. There are 5000000 urls links in my 'urls.txt' file. It's about 300M. How make a multi-threads link these urls and dowload these webpages? or How batch dowload these webpages?

my ideas:

with open('urls.txt','r') as f:
    for el in f:
        ##fetch these urls

or twisted?

Is there a good solution for it?

解决方案

If this isn't part of a larger program, then notnoop's idea of using some existing tool to accomplish this is a pretty good one. If a shell loop invoking wget solves your problem, that'll be a lot easier than anything involving more custom software development.

However, if you need to fetch these resources as part of a larger program, then doing it with shell may not be ideal. In this case, I'll strongly recommend Twisted, which will make it easy to do many requests in parallel.

A few years ago I wrote up an example of how to do just this. Take a look at http://jcalderone.livejournal.com/24285.html.

这篇关于Python,多线程,抓取网页,下载网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆