是否可以在 Apache Spark 中创建嵌套的 RDD? [英] Is it possible to create nested RDDs in Apache Spark?

查看:35
本文介绍了是否可以在 Apache Spark 中创建嵌套的 RDD?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Spark 中实现 K 最近邻算法.我想知道是否可以使用嵌套的 RDD.这会让我的生活轻松很多.考虑以下代码片段.

I am trying to implement K-nearest neighbor algorithm in Spark. I was wondering if it is possible to work with nested RDD's. This will make my life a lot easier. Consider the following code snippet.

public static void main (String[] args){
//blah blah code
JavaRDD<Double> temp1 = testData.map(
    new Function<Vector,Double>(){
        public Double call(final Vector z) throws Exception{
            JavaRDD<Double> temp2 = trainData.map(
                    new Function<Vector, Double>() {
                        public Double call(Vector vector) throws Exception {
                            return (double) vector.length();
                        }
                    }
            );
            return (double)z.length();
        }    
    }
);
}

目前我遇到了这个嵌套设置的错误(我可以在这里发布完整的日志).是否允许在拳头位置?谢谢

Currently I am getting error with this nested settings (I can post here the full log). Is it allowed in the fist place? Thanks

推荐答案

不,这是不可能的,因为 RDD 的项必须是可序列化的,而 RDD 是不可序列化的.这是有道理的,否则你可能会通过网络传输整个 RDD,如果它包含大量数据,这是一个问题.如果它不包含大量数据,您可能并且应该使用数组或类似的东西.

No, it is not possible, because the items of an RDD must be serializable and a RDD is not serializable. And this makes sense, otherwise you might transfer over the network a whole RDD which is a problem if it contains a lot of data. And if it does not contain a lot of data, you might and you should use an array or something like it.

然而,我不知道你是如何实现 K 近邻的......但要小心:如果你做一些像计算每对点之间的距离之类的事情,这实际上在数据集大小上是不可扩展的,因为它是 O(n2).

However, I don't know how you are implementing the K-nearest neighbor...but be careful: if you do something like calculating the distance between each couple of point, this is actually not scalable in the dataset size, because it's O(n2).

这篇关于是否可以在 Apache Spark 中创建嵌套的 RDD?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆