如何使用多重处理来遍历大量URL? [英] How to use multiprocessing to loop through a big list of URL?

查看:87
本文介绍了如何使用多重处理来遍历大量URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:检查超过1000个网址的列表,并获取网址返回码(status_code).

我的脚本可以运行,但是速度很慢.

我认为必须有一种更好的pythonic(更漂亮)的方法,在该方法中,我可以产生10或20个线程来检查网址并收集共振. (即:

200 -> www.yahoo.com
404 -> www.badurl.com
...


输入文件:Url10.txt

www.example.com
www.yahoo.com
www.testsite.com

....

import requests

with open("url10.txt") as f:
    urls = f.read().splitlines()

print(urls)
for url in urls:
    url =  'http://'+url   #Add http:// to each url (there has to be a better way to do this)
    try:
        resp = requests.get(url, timeout=1)
        print(len(resp.content), '->', resp.status_code, '->', resp.url)
    except Exception as e:
        print("Error", url)

挑战: 通过多处理提高速度.


具有多处理功能

但是它不起作用吗? 我收到以下错误消息:(注意:我不确定我是否已经正确实施了此操作)

AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>

-

import requests
from multiprocessing import Pool

with open("url10.txt") as f:
    urls = f.read().splitlines()

def checkurlconnection(url):

    for url in urls:
        url =  'http://'+url
        try:
            resp = requests.get(url, timeout=1)
            print(len(resp.content), '->', resp.status_code, '->', resp.url)
        except Exception as e:
            print("Error", url)

if __name__ == "__main__":
    p = Pool(processes=4)
    result = p.map(checkurlconnection, urls)

解决方案

在这种情况下,您的任务受I/O约束,而不与处理器约束-网站回复所需的时间比CPU循环一次通过所需的时间长您的脚本(不包括TCP请求).这意味着您无法并行执行此任务(multiprocessing会执行此操作).您想要的是多线程.实现这一目标的方法是使用记录很少的,也许是名字不好的multiprocessing.dummy:

import requests
from multiprocessing.dummy import Pool as ThreadPool 

urls = ['https://www.python.org',
        'https://www.python.org/about/']

def get_status(url):
    r = requests.get(url)
    return r.status_code

if __name__ == "__main__":
    pool = ThreadPool(4)  # Make the Pool of workers
    results = pool.map(get_status, urls) #Open the urls in their own threads
    pool.close() #close the pool and wait for the work to finish 
    pool.join() 

请参阅此处,以了解Python中的多处理与多线程示例. /p>

Problem: Check a listing of over 1000 urls and get the url return code (status_code).

The script I have works but very slow.

I am thinking there has to be a better, pythonic (more beutifull) way of doing this, where I can spawn 10 or 20 threads to check the urls and collect resonses. (i.e:

200 -> www.yahoo.com
404 -> www.badurl.com
...


Input file:Url10.txt

www.example.com
www.yahoo.com
www.testsite.com

....

import requests

with open("url10.txt") as f:
    urls = f.read().splitlines()

print(urls)
for url in urls:
    url =  'http://'+url   #Add http:// to each url (there has to be a better way to do this)
    try:
        resp = requests.get(url, timeout=1)
        print(len(resp.content), '->', resp.status_code, '->', resp.url)
    except Exception as e:
        print("Error", url)

Challenges: Improve speed with multiprocessing.


With multiprocessing

But is it not working. I get the following error: (note: I am not sure if I have even implemented this correctly)

AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>

--

import requests
from multiprocessing import Pool

with open("url10.txt") as f:
    urls = f.read().splitlines()

def checkurlconnection(url):

    for url in urls:
        url =  'http://'+url
        try:
            resp = requests.get(url, timeout=1)
            print(len(resp.content), '->', resp.status_code, '->', resp.url)
        except Exception as e:
            print("Error", url)

if __name__ == "__main__":
    p = Pool(processes=4)
    result = p.map(checkurlconnection, urls)

解决方案

In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy:

import requests
from multiprocessing.dummy import Pool as ThreadPool 

urls = ['https://www.python.org',
        'https://www.python.org/about/']

def get_status(url):
    r = requests.get(url)
    return r.status_code

if __name__ == "__main__":
    pool = ThreadPool(4)  # Make the Pool of workers
    results = pool.map(get_status, urls) #Open the urls in their own threads
    pool.close() #close the pool and wait for the work to finish 
    pool.join() 

See here for examples of multiprocessing vs multithreading in Python.

这篇关于如何使用多重处理来遍历大量URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆