阿帕奇星火:从sklearn并行上的分区应用功能 [英] Apache Spark: Applying a function from sklearn parallel on partitions
问题描述
我是新来的大数据和Apache火花(和本科生主管下做的工作)。
是否有可能一个函数(即一个样条)适用于RDD的唯一分区?我试图执行一些工作中的文件这里
这本书学习星火,似乎表明这是可能的,但并没有解释如何。
如果你不是有要培养不同的学习模式许多小数据集,这将是更好的每个节点上使用单节点学习库(如Weka的或SciKit-学习)的 STRONG>,也许称其为跨节点并行使用星火
地图()
。
块引用>解决方案事实上,我们有这不正是一个图书馆。我们有几个sklearn transformators和predictors启动和运行。它的名字是sparkit学习。结果
从我们的例子:从splearn.rdd进口DictRDD
从splearn.feature_extraction.text进口SparkHashingVectorizer
从splearn.feature_extraction.text进口SparkTfidfTransformer
从splearn.svm进口SparkLinearSVC
从splearn.pipeline进口SparkPipeline从sklearn.feature_extraction.text进口HashingVectorizer
从sklearn.feature_extraction.text进口TfidfTransformer
从sklearn.svm进口LinearSVC
从sklearn.pipeline进口管道X = [...]#文本列表
Y = [...]#标签列表
X_rdd = sc.parallelize(X,4)
y_rdd = sc.parralelize(Y,4)
Z = DictRDD((X_rdd,y_rdd)
列=('X','Y'),
DTYPE = [np.ndarray,np.ndarray])local_pipeline =管线((
('VECT',HashingVectorizer()),
('TFIDF',TfidfTransformer()),
('CLF',LinearSVC())
))
dist_pipeline = SparkPipeline((
('VECT',SparkHashingVectorizer()),
('TFIDF',SparkTfidfTransformer()),
('CLF',SparkLinearSVC())
))local_pipeline.fit(x,y)的
dist_pipeline.fit(Z,clf__classes = np.unique(Y))Y_ pred_local = local_pipeline。predict(X)
Y_ pred_dist = dist_pipeline。predict(Z [:,'X'])您可以在这里找到它 。
I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).
Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.
The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.
"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark
map()
."
解决方案Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:from splearn.rdd import DictRDD from splearn.feature_extraction.text import SparkHashingVectorizer from splearn.feature_extraction.text import SparkTfidfTransformer from splearn.svm import SparkLinearSVC from splearn.pipeline import SparkPipeline from sklearn.feature_extraction.text import HashingVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline X = [...] # list of texts y = [...] # list of labels X_rdd = sc.parallelize(X, 4) y_rdd = sc.parralelize(y, 4) Z = DictRDD((X_rdd, y_rdd), columns=('X', 'y'), dtype=[np.ndarray, np.ndarray]) local_pipeline = Pipeline(( ('vect', HashingVectorizer()), ('tfidf', TfidfTransformer()), ('clf', LinearSVC()) )) dist_pipeline = SparkPipeline(( ('vect', SparkHashingVectorizer()), ('tfidf', SparkTfidfTransformer()), ('clf', SparkLinearSVC()) )) local_pipeline.fit(X, y) dist_pipeline.fit(Z, clf__classes=np.unique(y)) y_pred_local = local_pipeline.predict(X) y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
You can find it here.
这篇关于阿帕奇星火:从sklearn并行上的分区应用功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!