数据帧示例在Apache spark |斯卡拉 [英] Dataframe sample in Apache spark | Scala

查看:126
本文介绍了数据帧示例在Apache spark |斯卡拉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从两个数据框中取出样本,其中需要维护的比例。例如

I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg

df1.count() = 10
df2.count() = 1000

noOfSamples = 10

我想以这样一种方式对数据进行采样,我得到10个尺寸样本每个101个(1个来自df1和100个来自df2)

I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)

现在,这样做

var newSample = df1.sample(true, df1.count() / noOfSamples)
println(newSample.count())

这里的分数是什么意思?可以大于1吗?我检查了这个,但不能完全理解。

What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.

还有,我们可以指定要采样的行数吗?

Also is there anyway we can specify the number of rows to be sampled?

推荐答案

分数参数表示将要返回的数据集的 aproximate 分数。例如,如果将其设置为 0.1 ,则将返回10%(1/10)的行。对于您的情况,我相信您要执行以下操作:

The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:

val newSample = df1.sample(true, 1D*noOfSamples/df1.count)

但是,您可能会注意到 newSample.count 将在每次运行它时返回不同的数字,这是因为分数将成为随机生成值的阈值(可以看到< a href =http://stackoverflow.com/questions/32229941/>这里),因此生成的数据集大小可能会有所不同。解决方法可以是:

However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:

val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)

对于您的问题:


可以大于1吗?

can it be greater than 1?

否。它代表一个分数,因此它必须是介于0和1之间的十进制数。如果将其设置为1,它将带来100%的行,因此将其设置为大于1的数字是没有意义的。 / p>

No. It represents a fraction, so it must be a decimal number between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.


还有,我们可以指定要采样的行数吗?

Also is there anyway we can specify the number of rows to be sampled?

您可以指定比所需行数更大的分数,然后使用limit,如第二个示例所示。也许有另一种方式,但这是我使用的方法。

You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.

这篇关于数据帧示例在Apache spark |斯卡拉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆