阿帕奇星火：从sklearn并行上的分区应用功能 [英] Apache Spark: Applying a function from sklearn parallel on partitions

查看：383 发布时间：2016/5/22 16:17:56 apache-spark

本文介绍了阿帕奇星火：从sklearn并行上的分区应用功能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是新来的大数据和Apache火花（和本科生主管下做的工作）。

是否有可能一个函数（即一个样条）适用于RDD的唯一分区？我试图执行一些工作中的文件这里

这本书学习星火，似乎表明这是可能的，但并没有解释如何。

如果你不是有要培养不同的学习模式许多小数据集，这将是更好的每个节点 STRONG>，也许称其为跨节点并行使用星火地图（）。

解决方案

事实上，我们有这不正是一个图书馆。我们有几个sklearn transformators和predictors启动和运行。它的名字是sparkit学习。结果
从我们的例子：
从splearn.rdd进口DictRDD 从splearn.feature_extraction.text进口SparkHashingVectorizer 从splearn.feature_extraction.text进口SparkTfidfTransformer 从splearn.svm进口SparkLinearSVC 从splearn.pipeline进口SparkPipeline从sklearn.feature_extraction.text进口HashingVectorizer 从sklearn.feature_extraction.text进口TfidfTransformer 从sklearn.svm进口LinearSVC 从sklearn.pipeline进口管道X = [...]＃文本列表 Y = [...]＃标签列表 X_rdd = sc.parallelize（X，4） y_rdd = sc.parralelize（Y，4） Z = DictRDD（（X_rdd，y_rdd）列=（'X'，'Y'）， DTYPE = [np.ndarray，np.ndarray]）local_pipeline =管线（（（'VECT'，HashingVectorizer（）），（'TFIDF'，TfidfTransformer（）），（'CLF'，LinearSVC（）））） dist_pipeline = SparkPipeline（（（'VECT'，SparkHashingVectorizer（）），（'TFIDF'，SparkTfidfTransformer（）），（'CLF'，SparkLinearSVC（））））local_pipeline.fit（x，y）的 dist_pipeline.fit（Z，clf__classes = np.unique（Y））Y_ pred_local = local_pipeline。predict（X） Y_ pred_dist = dist_pipeline。predict（Z [：，'X']）
您可以在这里找到它。
I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).

Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.

The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.

"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark map()."

解决方案
Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:
from splearn.rdd import DictRDD from splearn.feature_extraction.text import SparkHashingVectorizer from splearn.feature_extraction.text import SparkTfidfTransformer from splearn.svm import SparkLinearSVC from splearn.pipeline import SparkPipeline from sklearn.feature_extraction.text import HashingVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline X = [...] # list of texts y = [...] # list of labels X_rdd = sc.parallelize(X, 4) y_rdd = sc.parralelize(y, 4) Z = DictRDD((X_rdd, y_rdd), columns=('X', 'y'), dtype=[np.ndarray, np.ndarray]) local_pipeline = Pipeline(( ('vect', HashingVectorizer()), ('tfidf', TfidfTransformer()), ('clf', LinearSVC()) )) dist_pipeline = SparkPipeline(( ('vect', SparkHashingVectorizer()), ('tfidf', SparkTfidfTransformer()), ('clf', SparkLinearSVC()) )) local_pipeline.fit(X, y) dist_pipeline.fit(Z, clf__classes=np.unique(y)) y_pred_local = local_pipeline.predict(X) y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
You can find it here.

这篇关于阿帕奇星火：从sklearn并行上的分区应用功能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

阿帕奇星火：从sklearn并行上的分区应用功能 [英] Apache Spark: Applying a function from sklearn parallel on partitions

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

阿帕奇星火：从sklearn并行上的分区应用功能 [英] Apache Spark: Applying a function from sklearn parallel on partitions

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭