如何在每个执行程序中加载一次文件? [英] How to load a file in each executor once?

查看：32 发布时间：2021/12/22 21:24:34 apache-spark pyspark fasttext

本文介绍了如何在每个执行程序中加载一次文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我定义了以下代码以加载预训练的嵌入模型:

I define the following code in order to load a pretrained embedding model:

import gensim

from gensim.models.fasttext import FastText as FT_gensim
import numpy as np

class Loader(object):
    cache = {}
    emb_dic = {}
    count = 0
    def __init__(self, filename):
        print("|-------------------------------------|")
        print ("Welcome to Loader class in python")
        print("|-------------------------------------|")
        self.fn = filename

    @property
    def fasttext(self):
        if Loader.count == 1:
                print("already loaded")
        if self.fn not in Loader.cache:
            Loader.cache[self.fn] =  FT_gensim.load_fasttext_format(self.fn)
            Loader.count = Loader.count + 1
        return Loader.cache[self.fn]


    def map(self, word):
        if word not in self.fasttext:
            Loader.emb_dic[word] = np.random.uniform(low = 0.0, high = 1.0, size = 300)
            return Loader.emb_dic[word]
        return self.fasttext[word]

我称这个类为:

inputRaw = sc.textFile(inputFile, 3).map(lambda line: (line.split("	")[0], line.split("	")[1])).map(Loader(modelpath).map)

我对模型路径文件将被加载多少次感到困惑?我想为每个执行程序加载一次并被其所有核心使用.我对这个问题的回答是模型路径将被加载 3 次(=分区数.).如果我的回答是正确的，这种建模的缺点与文件模型路径的大小有关.假设这个文件是 10 GB 并且假设我有 200 个分区.因此，在这种情况下，我们将需要 10*200gb = 2000 并且非常大(此解决方案仅适用于少量分区.)

假设我有一个rdd =(id, sentence) =[(id1, u'patina californian'), (id2, u'virgil american'), (id3', u'frensh'), (id4, u'american')]

我想总结每个句子的嵌入词向量:

and i want to sumup the embedding word vectors for each sentence:

def test(document):
    print("document is = {}".format(document))
    documentWords = document.split(" ")
    features = np.zeros(300)
    for word in documentWords:
        features = np.add(features, Loader(modelpath).fasttext[word])
    return features

def calltest(inputRawSource):

    my_rdd = inputRawSource.map(lambda line: (line[0], test(line[1]))).cache()
    return my_rdd

在这种情况下，模型路径文件将被加载多少次?请注意，我将 spark.executor.instances" 设置为 3

In this case how many times the modelpath file will be loaded? Note that i set spark.executor.instances" to 3

如何在每个执行程序中加载一次文件? [英] How to load a file in each executor once?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在每个执行程序中加载一次文件? [英] How to load a file in each executor once?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭