Spark Sampling - 比使用完整的 RDD/DataFrame 快多少 [英] Spark Sampling - How much faster is it than using the full RDD/DataFrame
问题描述
我想知道与完整 RDD/DF 的运行时间相比,在对 RDD/DF 进行采样时 Spark 的运行时间是多少.我不知道这是否有所不同,但我目前使用的是 Java + Spark 1.5.1 + Hadoop 2.6.
I'm wondering what the runtime of Spark is when sampling a RDD/DF compared with the runtime of the full RDD/DF. I don't know if it makes a difference but I'm currently using Java + Spark 1.5.1 + Hadoop 2.6.
JavaRDD<Row> rdd = sc.textFile(HdfsDirectoryPath()).map(new Function<String, Row>() {
@Override
public Row call(String line) throws Exception {
String[] fields = line.split(usedSeparator);
GenericRowWithSchema row = new GenericRowWithSchema(fields, schema);//Assum that the schema has 4 integer columns
return row;
}
});
DataFrame df = sqlContext.createDataFrame(rdd, schema);
df.registerTempTable("df");
DataFrame selectdf = sqlContext.sql("Select * from df");
Row[] res = selectdf.collect();
DataFrame sampleddf = sqlContext.createDataFrame(rdd, schema).sample(false, 0.1);// 10% of the original DS
sampleddf.registerTempTable("sampledf");
DataFrame selecteSampledf = sqlContext.sql("Select * from sampledf");
res = selecteSampledf.collect();
我希望采样速度最好接近约 90%.但对我来说,火花似乎贯穿整个 DF 或进行计数,这基本上与完整 DF 选择所需的时间几乎相同.样本生成后,执行select.
I would expect that the sampling is optimally close to ~90% faster. But for me it looks like that spark goes through the whole DF or does a count, which basically takes nearly the same time as for the full DF select. After the sample is generated, it executes the select.
我的这个假设是否正确,还是以错误的方式使用采样导致我最终对两个选择都具有相同的所需运行时间?
Am I correct with this assumptions or is the sampling used in a wrong way what causes me to end up with the same required runtime for both selects?
推荐答案
我希望采样速度最好接近约 90%.
I would expect that the sampling is optimally close to ~90% faster.
嗯,这些期望不切实际的原因有几个:
Well, there are a few reasons why these expectations are unrealistic:
- 在没有关于数据分布的任何先前假设的情况下,要获得统一的样本,您必须执行完整的数据集扫描.当您在 Spark 中使用
sample
或takeSample
方法时,或多或少会发生这种情况 SELECT *
是一个相对轻量级的操作.根据您有时间处理单个分区的资源量,可以忽略不计- 采样不会减少分区数.如果您不
coalesce
或repartition
,您最终可能会得到大量几乎为空的分区.这意味着资源使用不理想. - 虽然 RNG 通常非常有效,但生成随机数并不是免费的
- without any previous assumptions about data distribution, to obtain an uniform sample, you have to perform a full dataset scan. This is more or less what happens when you use
sample
ortakeSample
methods in Spark SELECT *
is a relatively lightweight operation. Depending on the amount of resources you have time to process a single partition can be negligible- sampling doesn't reduce number of partitions. If you don't
coalesce
orrepartition
you can end up with a large number of almost empty partitions. It means suboptimal resource usage. - while RNGs are usually quite efficient generating random numbers is not free
抽样至少有两个重要的好处:
There are at least two important benefits of sampling:
- 降低内存使用量,包括减少垃圾收集器的工作
- 用于序列化/反序列化和转移的数据更少,以防洗牌或收集
如果您想从采样中获得最大收益,那么采样、合并和缓存是有意义的.
If you want to get most from sampling it make sense to sample, coalesce, and cache.
这篇关于Spark Sampling - 比使用完整的 RDD/DataFrame 快多少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!