是否有可能建立在Apache的星火嵌套RDDS? [英] Is it possible to create nested RDDs in Apache Spark?
问题描述
我试图实施星火K近邻算法。我不知道是否有可能使用嵌套RDD的工作。这将使我的生活轻松了许多。请看下面的code片段。
I am trying to implement K-nearest neighbor algorithm in Spark. I was wondering if it is possible to work with nested RDD's. This will make my life a lot easier. Consider the following code snippet.
public static void main (String[] args){
//blah blah code
JavaRDD<Double> temp1 = testData.map(
new Function<Vector,Double>(){
public Double call(final Vector z) throws Exception{
JavaRDD<Double> temp2 = trainData.map(
new Function<Vector, Double>() {
public Double call(Vector vector) throws Exception {
return (double) vector.length();
}
}
);
return (double)z.length();
}
}
);
}
目前我得到的错误与此嵌套的设置(我可以在这里发布完整的日志)。难道是在拳头地方允许吗?谢谢
Currently I am getting error with this nested settings (I can post here the full log). Is it allowed in the fist place? Thanks
推荐答案
没有,那是不可能的,因为RDD的项目必须是可序列化和RDD是不可串行化。这是有道理的,否则你可能会在网络上的整体RDD这是一个问题,如果它包含大量的数据传输。如果它不包含大量的数据,你可能,你应该使用数组或类似的东西。
No, it is not possible, because the items of an RDD must be serializable and a RDD is not serializable. And this makes sense, otherwise you might transfer over the network a whole RDD which is a problem if it contains a lot of data. And if it does not contain a lot of data, you might and you should use an array or something like it.
不过,我不知道你是如何实现的K近邻......但是要注意:如果你做类似计算每对夫妇点之间的距离,这其实不是在数据集大小可扩展,因为它是O(N2)。
However, I don't know how you are implementing the K-nearest neighbor...but be careful: if you do something like calculating the distance between each couple of point, this is actually not scalable in the dataset size, because it's O(n2).
这篇关于是否有可能建立在Apache的星火嵌套RDDS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!