Python/BeautifulSoup 抓取中的多线程根本没有加速 [英] Multithreading in Python/BeautifulSoup scraping doesn't speed up at all

查看:13
本文介绍了Python/BeautifulSoup 抓取中的多线程根本没有加速的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 csv 文件(SomeSiteValidURLs.csv"),其中列出了我需要抓取的所有链接.该代码正在运行,并将通过 csv 中的 url,抓取信息并记录/保存在另一个 csv 文件(Output.csv")中.但是,由于我计划为网站的大部分内容(> 10,000,000 页)执行此操作,因此速度很重要.对于每个链接,爬取信息并保存到csv中大约需要1s,这对于项目的规模来说太慢了.所以我已经合并了多线程模块,令我惊讶的是它根本没有加速,它仍然需要 1s 人链接.我做错什么了吗?有没有其他方法可以加快处理速度?

I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?

没有多线程:

import urllib2
import csv
from bs4 import BeautifulSoup
import threading

def crawlToCSV(FileName):

    with open(FileName, "rb") as f:
        for URLrecords in f:

            OpenSomeSiteURL = urllib2.urlopen(URLrecords)
            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
            OpenSomeSiteURL.close()

            tbodyTags = Soup_SomeSite.find("tbody")
            trTags = tbodyTags.find_all("tr", class_="result-item ")

            placeHolder = []

            for trTag in trTags:
                tdTags = trTag.find("td", class_="result-value")
                tdTags_string = tdTags.string
                placeHolder.append(tdTags_string)

            with open("Output.csv", "ab") as f:
                writeFile = csv.writer(f)
                writeFile.writerow(placeHolder)

crawltoCSV("SomeSiteValidURLs.csv")

多线程:

import urllib2
import csv
from bs4 import BeautifulSoup
import threading

def crawlToCSV(FileName):

    with open(FileName, "rb") as f:
        for URLrecords in f:

            OpenSomeSiteURL = urllib2.urlopen(URLrecords)
            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
            OpenSomeSiteURL.close()

            tbodyTags = Soup_SomeSite.find("tbody")
            trTags = tbodyTags.find_all("tr", class_="result-item ")

            placeHolder = []

            for trTag in trTags:
                tdTags = trTag.find("td", class_="result-value")
                tdTags_string = tdTags.string
                placeHolder.append(tdTags_string)

            with open("Output.csv", "ab") as f:
                writeFile = csv.writer(f)
                writeFile.writerow(placeHolder)

fileName = "SomeSiteValidURLs.csv"

if __name__ == "__main__":
    t = threading.Thread(target=crawlToCSV, args=(fileName, ))
    t.start()
    t.join()

推荐答案

您没有正确地并行处理.您真正想要做的是让在您的 for 循环中完成的工作在许多工作人员之间同时发生.现在您正在将所有的工作转移到一个后台线程中,该线程同步完成所有工作.这根本不会提高性能(实际上,它只会稍微伤害它).

You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).

这是一个使用 ThreadPool 并行化网络操作和解析的示例.尝试同时跨多个线程写入 csv 文件是不安全的,因此我们将本应写回父级的数据返回,并让父级在最后将所有结果写入文件.

Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.

import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

def crawlToCSV(URLrecord):
    OpenSomeSiteURL = urllib2.urlopen(URLrecord)
    Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
    OpenSomeSiteURL.close()

    tbodyTags = Soup_SomeSite.find("tbody")
    trTags = tbodyTags.find_all("tr", class_="result-item ")

    placeHolder = []

    for trTag in trTags:
        tdTags = trTag.find("td", class_="result-value")
        tdTags_string = tdTags.string
        placeHolder.append(tdTags_string)

    return placeHolder


if __name__ == "__main__":
    fileName = "SomeSiteValidURLs.csv"
    pool = Pool(cpu_count() * 2)  # Creates a Pool with cpu_count * 2 threads.
    with open(fileName, "rb") as f:
        results = pool.map(crawlToCSV, f)  # results is a list of all the placeHolder lists returned from each call to crawlToCSV
    with open("Output.csv", "ab") as f:
        writeFile = csv.writer(f)
        for result in results:
            writeFile.writerow(result)

请注意,在 Python 中,线程实际上只会加速 I/O 操作 - 由于 GIL,CPU 密集型操作(例如解析/搜索 BeautifulSoup 正在执行)实际上无法完成通过线程并行,因为一次只有一个线程可以执行基于 CPU 的操作.因此,您可能仍然看不到使用这种方法所希望的速度.当您需要在 Python 中加速 CPU 密集型操作时,您需要使用多个进程而不是线程.幸运的是,您可以很容易地看到这个脚本是如何在多个进程而不是多个线程中执行的;只需将 from multiprocessing.dummy import Pool 更改为 from multiprocessing import Pool.无需其他更改.

Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.

如果您需要将其扩展为 10,000,000 行的文件,您将需要稍微调整此代码 - Pool.map 将您传递给它的可迭代对象转换为列表在将其发送给您的工人之前,这显然不适用于 10,000,000 个条目列表;将整个事情放在内存中可能会使您的系统陷入困境.将所有结果存储在列表中的相同问题.相反,您应该使用 Pool.imap:

If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:

imap(func, iterable[, chunksize])

map() 的懒惰版本.

A lazier version of map().

chunksize 参数与 map() 使用的参数相同方法.对于很长的迭代,使用大的块大小值可以使作业完成速度比使用默认值 1 快得多.

The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

if __name__ == "__main__":
    fileName = "SomeSiteValidURLs.csv"
    FILE_LINES = 10000000
    NUM_WORKERS = cpu_count() * 2
    chunksize = FILE_LINES // NUM_WORKERS * 4   # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
    pool = Pool(NUM_WORKERS)

    with open(fileName, "rb") as f:
        result_iter = pool.imap(crawlToCSV, f)
    with open("Output.csv", "ab") as f:
        writeFile = csv.writer(f)
        for result in result_iter:  # lazily iterate over results.
            writeFile.writerow(result)

使用imap,我们不会一次性将所有的f 放入内存,也不会一次性将所有结果存入内存.我们在内存中最多的是 chunksizef,这应该更易于管理.

With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.

这篇关于Python/BeautifulSoup 抓取中的多线程根本没有加速的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆