阿帕奇星火:从sklearn并行上的分区应用功能 [英] Apache Spark: Applying a function from sklearn parallel on partitions

查看:383
本文介绍了阿帕奇星火:从sklearn并行上的分区应用功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的大数据和Apache火花(和本科生主管下做的工作)。

是否有可能一个函数(即一个样条)适用于RDD的唯一分区?我试图执行一些工作中的文件这里

这本书学习星火,似乎表明这是可能的,但并没有解释如何。


  

如果你不是有要培养不同的学习模式许多小数据集,这将是更好的每个节点 STRONG>,也许称其为跨节点并行使用星火地图()



解决方案

事实上,我们有这不正是一个图书馆。我们有几个sklearn transformators和predictors启动和运行。它的名字是sparkit学习。结果
从我们的例子:

从splearn.rdd进口DictRDD
从splearn.feature_extraction.text进口SparkHashingVectorizer
从splearn.feature_extraction.text进口SparkTfidfTransformer
从splearn.svm进口SparkLinearSVC
从splearn.pipeline进口SparkPipeline从sklearn.feature_extraction.text进口HashingVectorizer
从sklearn.feature_extraction.text进口TfidfTransformer
从sklearn.svm进口LinearSVC
从sklearn.pipeline进口管道X = [...]#文本列表
Y = [...]#标签列表
X_rdd = sc.parallelize(X,4)
y_rdd = sc.parralelize(Y,4)
Z = DictRDD((X_rdd,y_rdd)
            列=('X','Y'),
            DTYPE = [np.ndarray,np.ndarray])local_pipeline =管线((
    ('VECT',HashingVectorizer()),
    ('TFIDF',TfidfTransformer()),
    ('CLF',LinearSVC())
))
dist_pipeline = SparkPipeline((
    ('VECT',SparkHashingVectorizer()),
    ('TFIDF',SparkTfidfTransformer()),
    ('CLF',SparkLinearSVC())
))local_pipeline.fit(x,y)的
dist_pipeline.fit(Z,clf__classes = np.unique(Y))Y_ pred_local = local_pipeline。predict(X)
Y_ pred_dist = dist_pipeline。predict(Z [:,'X'])

您可以在这里找到它

I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).

Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.

The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.

"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark map()."

解决方案

Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:

from splearn.rdd import DictRDD  
from splearn.feature_extraction.text import SparkHashingVectorizer  
from splearn.feature_extraction.text import SparkTfidfTransformer  
from splearn.svm import SparkLinearSVC  
from splearn.pipeline import SparkPipeline  

from sklearn.feature_extraction.text import HashingVectorizer  
from sklearn.feature_extraction.text import TfidfTransformer  
from sklearn.svm import LinearSVC  
from sklearn.pipeline import Pipeline  

X = [...]  # list of texts  
y = [...]  # list of labels  
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parralelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            dtype=[np.ndarray, np.ndarray])

local_pipeline = Pipeline((
    ('vect', HashingVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
    ('vect', SparkHashingVectorizer()),
    ('tfidf', SparkTfidfTransformer()),
    ('clf', SparkLinearSVC())
))

local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))

y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])

You can find it here.

这篇关于阿帕奇星火:从sklearn并行上的分区应用功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆