Spark数据帧随机分割 [英] Spark Data Frame Random Splitting

查看:66
本文介绍了Spark数据帧随机分割的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个火花数据框,我想按0.60、0.20、0.20的比例划分为训练,验证和测试.

I have a spark data frame which I want to divide into train, validation and test in the ratio 0.60, 0.20,0.20.

我将以下代码用于同一代码:

I used the following code for the same:

def data_split(x):
    global data_map_var
    d_map = data_map_var.value
    data_row = x.asDict()
    import random
    rand = random.uniform(0.0,1.0)
    ret_list = ()
    if rand <= 0.6:
        ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'train')
    elif rand <=0.8:
        ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'test')
    else:
        ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'validation')
    return ret_list
​
​
split_sdf = ratings_sdf.map(data_split)
train_sdf = split_sdf.filter(lambda x : x[-1] == 'train').map(lambda x :(x[0],x[1],x[2]))
test_sdf = split_sdf.filter(lambda x : x[-1] == 'test').map(lambda x :(x[0],x[1],x[2]))
validation_sdf = split_sdf.filter(lambda x : x[-1] == 'validation').map(lambda x :(x[0],x[1],x[2]))
​
print "Total Records in Original Ratings RDD is {}".format(split_sdf.count())
​
print "Total Records in training data RDD is {}".format(train_sdf.count())
​
print "Total Records in validation data RDD is {}".format(validation_sdf.count())
​
print "Total Records in test data RDD is {}".format(test_sdf.count())
​
​
#help(ratings_sdf)
Total Records in Original Ratings RDD is 300001
Total Records in training data RDD is 180321
Total Records in validation data RDD is 59763
Total Records in test data RDD is 59837

我的原始数据帧是ratings_sdf,我使用它来传递进行拆分的映射器函数.

My original data frame is ratings_sdf which I use to pass a mapper function which does the splitting.

如果您检查总的火车总和,则验证和测试的总和不会分裂(原始等级)计数.这些数字在每次运行代码时都会改变.

If you check the total sum of train, validation and test does not sum to split (original ratings) count. And these numbers change at every run of the code.

剩余的记录在哪里,为什么总和不相等?

Where is the remaining records going and why the sum is not equal?

推荐答案

TL; DR 如果要拆分DataFrame,请使用

TL;DR If you want to split DataFrame use randomSplit method:

ratings_sdf.randomSplit([0.6, 0.2, 0.2])

您的代码在多个级别上都是错误的,但是有两个基本问题使其无法修复:

Your code is just wrong on multiple levels but there are two fundamental problems that make it broken beyond repair:

    可以任意次数评估
  • Spark转换,并且您使用的函数应该是参照透明的且无副作用.您的代码多次评估split_sdf,并且您使用有状态的RNG data_split,因此每次结果都不相同.

  • Spark transformations can be evaluated arbitrary number of times and functions you use should be referentially transparent and side effect free. Your code evaluates split_sdf multiple times and you use stateful RNG data_split so each time results are different.

这会导致您描述一个行为,其中每个孩子看到父RDD的不同状态.

This results in a behavior you describe where each child sees different state of the parent RDD.

您没有正确初始化RNG,因此获得的随机值不是独立的.

You don't properly initialize RNG and in consequence random values you get are not independent.

这篇关于Spark数据帧随机分割的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆