页面的链接和该子页面的链接.递归/线程 [英] Links of a page and links of that subpages. Recursion/Threads

查看:45
本文介绍了页面的链接和该子页面的链接.递归/线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作一个下载网站内容的函数,然后我在网站中查找链接,并且对于每个我递归调用相同函数的链接,直到第 7 级.问题是这需要很多时间,所以我希望使用线程池来管理此调用,但我不知道如何将这些任务准确地划分到线程池中.

I'm making a function that downloads the content of a website, then I look for the links in the site and for each one I call recursevily the same function untill the 7th level. The problem is that this takes a lot of time so i was looking to use a threadpool for manage this calls but i dont know how exactly to divide this tasks into the threadpool.

这是我的实际代码,没有线程池.

This is my actual code, without the threadpool.

import requests
import re

url = 'https://masdemx.com/category/creatividad/?fbclid=IwAR0G2AQa7QUzI-fsgRn3VOl5oejXKlC_JlfvUGBJf9xjQ4gcBsyHinYiOt8'


def searchLinks(url,level):
    print("level: "+str(level))
    if(level==3):
        return 0

    response = requests.get(url)
    enlaces = re.findall(r'<a href="(.*?)"',str(response.text))

    for en in enlaces:
        if (en[0] == "/" or en[0]=="#"):
            en= url+en[1:]
        print(en)
        searchLinks(en,level+1)


searchLinks(url,1)

推荐答案

对于初学者,请注意这可能是一个大操作.例如,如果每个页面平均只有 10 个唯一链接,那么如果您想要递归 7 层深度,您将看到超过 1000 万个请求.

For starters, note that this could be a big operation. For example, if each page has an average of only 10 unique links, you're looking at over 10 million requests if you want to recurse 7 layers deep.

另外,我会使用 HTML 解析库,例如 BeautifulSoup 而不是正则表达式是一种抓取 HTML 的脆弱方式.避免打印到标准输出,这也会减慢工作速度.

Also, I'd use an HTML parsing library like BeautifulSoup instead of regex which is a brittle way to scrape HTML. Avoid printing to stdout which also slows down the works.

对于线程,一种方法是使用工作队列.Python 的 队列类 是线程安全的,因此您可以创建工作线程池该轮询以从队列中检索 URL.当一个线程获得一个 URL 时,它会找到页面上的所有链接并将相关的 URL(或页面数据,如果你愿意)附加到一个全局列表(这是一个 CPython 上的线程安全操作——对于其他实现,使用共享数据结构的锁).这些 URL 会在工作队列中排队,并且该过程会继续.

As for threading, one approach is to use a work queue. Python's queue class is thread safe, so you can create a pool of worker threads that poll to retrieve URLs from the queue. When a thread gets a URL, it finds all the links on the page and appends the relevant URL (or page data, if you wish) to a global list (which is a thread-safe operation on CPython--for other implementations, use a lock on shared data structures). The URLs are enqueued on the work queue and the process continues.

当级别达到 0 时线程退出,因为我们使用的是 BFS 而不是比使用堆栈的 DFS.这里(可能是安全的)假设是链接的级别多于深度.

Threads exit when the level reaches 0 since we're using a BFS rather than a DFS using a stack. The (probably safe) assumption here is that there are more levels of links than the depth.

并行性来自阻塞等待请求响应的线程,允许 CPU 运行另一个响应到达的线程来执行 HTML 解析和排队工作.

The parallelism comes from threads blocking waiting for request responses, allowing the CPU to run a different thread whose response arrived to do HTML parsing and queue work.

如果您想在多个内核上运行以帮助并行化工作负载的 CPU 绑定部分,阅读这篇关于 GIL 的博文 并研究生成过程.但是线程本身就可以让您实现很多并行化,因为瓶颈是 I/O 限制(等待 HTTP 请求).

If you'd like to run on multiple cores to help parallelize the CPU-bound portion of the workload, read this blog post about the GIL and look into spawning processes. But threading alone gets you much of the way to parallelization since the bottleneck is I/O bound (waiting on HTTP requests).

这是一些示例代码:

import queue
import requests
import threading
import time
from bs4 import BeautifulSoup

def search_links(q, urls, seen):
    while 1:
        try:
            url, level = q.get()
        except queue.Empty:
            continue

        if level <= 0:
            break

        try:
            soup = BeautifulSoup(requests.get(url).text, "lxml")

            for x in soup.find_all("a", href=True):
                link = x["href"]

                if link and link[0] in "#/":
                    link = url + link[1:]

                if link not in seen:
                    seen.add(link)
                    urls.append(link)
                    q.put((link, level - 1))
        except (requests.exceptions.InvalidSchema, 
                requests.exceptions.ConnectionError):
            pass

if __name__ == "__main__":
    levels = 2
    workers = 10
    start_url = "https://masdemx.com/category/creatividad/?fbclid=IwAR0G2AQa7QUzI-fsgRn3VOl5oejXKlC_JlfvUGBJf9xjQ4gcBsyHinYiOt8"
    seen = set()
    urls = []
    threads = []
    q = queue.Queue()
    q.put((start_url, levels))
    start = time.time()
    
    for _ in range(workers):
        t = threading.Thread(target=search_links, args=(q, urls, seen))
        threads.append(t)
        t.daemon = True
        t.start()
    
    for thread in threads:
        thread.join()
    
    print(f"Found {len(urls)} URLs using {workers} workers "
          f"{levels} levels deep in {time.time() - start}s")

以下是在我的速度不是特别快的机器上运行的一些示例:

Here are a few sample runs on my not-especially-fast machine:

$ python thread_req.py
Found 762 URLs using 15 workers 2 levels deep in 33.625585317611694s
$ python thread_req.py
Found 762 URLs using 10 workers 2 levels deep in 42.211519956588745s
$ python thread_req.py
Found 762 URLs using 1 workers 2 levels deep in 105.16120409965515s

在这次小规模运行中,性能提升了 3 倍.我在较大的运行中遇到了最大的请求错误,所以这只是一个玩具示例.

That's a 3x performance boost on this small run. I ran into maximum request errors on larger runs, so this is just a toy example.

这篇关于页面的链接和该子页面的链接.递归/线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆