火花采样-比使用完整的RDD/DataFrame快多少 [英] Spark Sampling - How much faster is it than using the full RDD/DataFrame

查看:99
本文介绍了火花采样-比使用完整的RDD/DataFrame快多少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道与全RDD/DF的运行时间相比,采样RDD/DF时Spark的运行时间是多少.我不知道它是否有作用,但是我目前正在使用Java + Spark 1.5.1 + Hadoop 2.6.

I'm wondering what the runtime of Spark is when sampling a RDD/DF compared with the runtime of the full RDD/DF. I don't know if it makes a difference but I'm currently using Java + Spark 1.5.1 + Hadoop 2.6.

JavaRDD<Row> rdd = sc.textFile(HdfsDirectoryPath()).map(new Function<String, Row>() {
        @Override
        public Row call(String line) throws Exception {
            String[] fields = line.split(usedSeparator);
            GenericRowWithSchema row = new GenericRowWithSchema(fields, schema);//Assum that the schema has 4 integer columns
            return row;
            }
        });

DataFrame df   = sqlContext.createDataFrame(rdd, schema);
df.registerTempTable("df");
DataFrame selectdf   =  sqlContext.sql("Select * from df");
Row[] res = selectdf.collect();

DataFrame sampleddf  = sqlContext.createDataFrame(rdd, schema).sample(false, 0.1);// 10% of the original DS
sampleddf.registerTempTable("sampledf");
DataFrame selecteSampledf = sqlContext.sql("Select * from sampledf");
res = selecteSampledf.collect();

我希望采样速度最好快接近90%.但对我而言,火花似乎遍及整个DF或进行计数,这基本上与整个DF选择所需的时间几乎相同.生成样本后,它将执行选择.

I would expect that the sampling is optimally close to ~90% faster. But for me it looks like that spark goes through the whole DF or does a count, which basically takes nearly the same time as for the full DF select. After the sample is generated, it executes the select.

我是根据这个假设纠正的吗?还是抽样使用方式错误?是什么导致我最终两次选择都需要相同的运行时间?

Am I correct with this assumptions or is the sampling used in a wrong way what causes me to end up with the same required runtime for both selects?

推荐答案

我希望采样速度最好快至约90%.

I would expect that the sampling is optimally close to ~90% faster.

嗯,这些期望不切实际的原因有很多:

Well, there are a few reasons why these expectations are unrealistic:

  • 在没有任何先前有关数据分布的假设的情况下,要获得统一的样本,您必须执行完整的数据集扫描.这几乎是在Spark中使用sampletakeSample方法时发生的情况
  • SELECT *是相对轻量级的操作.根据您有多少时间来处理单个分区,资源可以忽略不计
  • 采样不会减少分区数.如果您不使用coalescerepartition,则可能会遇到大量几乎为空的分区.这意味着资源使用不理想.
  • RNG通常非常有效,生成随机数并不是免费的
  • without any previous assumptions about data distribution, to obtain an uniform sample, you have to perform a full dataset scan. This is more or less what happens when you use sample or takeSample methods in Spark
  • SELECT * is a relatively lightweight operation. Depending on the amount of resources you have time to process a single partition can be negligible
  • sampling doesn't reduce number of partitions. If you don't coalesce or repartition you can end up with a large number of almost empty partitions. It means suboptimal resource usage.
  • while RNGs are usually quite efficient generating random numbers is not free

至少有两个重要的好处:

There are at least two important benefits of sampling:

  • 降低内存使用量,减少垃圾回收器的工作量
  • 更少的数据可在改组或收集的情况下进行序列化/反序列化和传输

如果您想从采样中获取最大收益,则有必要进行采样,合并和缓存.

If you want to get most from sampling it make sense to sample, coalesce, and cache.

这篇关于火花采样-比使用完整的RDD/DataFrame快多少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆