Spark中的RDD示例 [英] RDD sample in Spark

查看:145
本文介绍了Spark中的RDD示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RDD SAMPLE如何在Spark中工作?其不同参数(例如sample(withReplacement,fraction,seed))的功能是什么.

How does a RDD SAMPLE works in spark? What is the functionality of its different parameters i.e. sample(withReplacement, fraction, seed).

我在网上找不到有关"withReplacement"和"seed"参数的任何相关信息.请举例说明.

I could not find anything relevant on web regarding 'withReplacement' and 'seed' parameters. Please explain with an example.

推荐答案

分数和种子很容易猜到-它们是您希望在样本中看到的元素比例(即.5样本将为您提供包含一半元素的初始RDD样本).种子是随机数生成器种子.这很重要,因为您可能希望能够对测试使用相同的种子进行硬编码,以便始终在测试中获得相同的结果,但是在生产代码中,请以毫秒为单位的当前时间或来自良好熵源的随机数替换为当前时间

fraction and seed are pretty easy to guess -- they are the fraction of elements you want to see in your sample (i.e. sample of .5 will give you a sample of initial RDD containing half of the elements). Seed is random number generator seed. This is important because you might want to be able to hard code the same seed for your tests so that you always get the same results in test, but in prod code replace it with current time in milliseconds or a random number from a good entropy source.

使用替换抽样是Google搜索的目标,例如 https://www.ma.utexas.edu/users/parker/sampling/repl.htm .简而言之,如果要进行替换采样,则可以在样本中获得两次相同的元素,而如果不进行替换,则只能获取一次.因此,如果您的RDD中有[鲍勃,爱丽丝和卡罗尔],那么您的有替换"样本可以是[爱丽丝,爱丽斯],但没有替换样本不能有这样的重复项.

With replacement sampling is a google search aways, e.g. https://www.ma.utexas.edu/users/parker/sampling/repl.htm. In short, if you are sampling with replacement, you can get the same element in sample twice, and w/o replacement you can only get it once. So if your RDD has [Bob, Alice and Carol] then your "with replacement" sample can be [Alice, Alice], but w/o replacement sample can't have duplicates like that.

这篇关于Spark中的RDD示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆