Spark Sampling - 比使用完整的 RDD/DataFrame 快多少 [英] Spark Sampling - How much faster is it than using the full RDD/DataFrame

查看：31 发布时间：2021/11/14 23:31:08 java apache-spark apache-spark-sql

本文介绍了Spark Sampling - 比使用完整的 RDD/DataFrame 快多少的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道与完整 RDD/DF 的运行时间相比，在对 RDD/DF 进行采样时 Spark 的运行时间是多少.我不知道这是否有所不同，但我目前使用的是 Java + Spark 1.5.1 + Hadoop 2.6.

I'm wondering what the runtime of Spark is when sampling a RDD/DF compared with the runtime of the full RDD/DF. I don't know if it makes a difference but I'm currently using Java + Spark 1.5.1 + Hadoop 2.6.

JavaRDD<Row> rdd = sc.textFile(HdfsDirectoryPath()).map(new Function<String, Row>() {
        @Override
        public Row call(String line) throws Exception {
            String[] fields = line.split(usedSeparator);
            GenericRowWithSchema row = new GenericRowWithSchema(fields, schema);//Assum that the schema has 4 integer columns
            return row;
            }
        });

DataFrame df   = sqlContext.createDataFrame(rdd, schema);
df.registerTempTable("df");
DataFrame selectdf   =  sqlContext.sql("Select * from df");
Row[] res = selectdf.collect();

DataFrame sampleddf  = sqlContext.createDataFrame(rdd, schema).sample(false, 0.1);// 10% of the original DS
sampleddf.registerTempTable("sampledf");
DataFrame selecteSampledf = sqlContext.sql("Select * from sampledf");
res = selecteSampledf.collect();

我希望采样速度最好接近约 90%.但对我来说，火花似乎贯穿整个 DF 或进行计数，这基本上与完整 DF 选择所需的时间几乎相同.样本生成后，执行select.

I would expect that the sampling is optimally close to ~90% faster. But for me it looks like that spark goes through the whole DF or does a count, which basically takes nearly the same time as for the full DF select. After the sample is generated, it executes the select.

我的这个假设是否正确，还是以错误的方式使用采样导致我最终对两个选择都具有相同的所需运行时间?

Am I correct with this assumptions or is the sampling used in a wrong way what causes me to end up with the same required runtime for both selects?

Spark Sampling - 比使用完整的 RDD/DataFrame 快多少 [英] Spark Sampling - How much faster is it than using the full RDD/DataFrame

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Spark Sampling - 比使用完整的 RDD/DataFrame 快多少 [英] Spark Sampling - How much faster is it than using the full RDD/DataFrame

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭