火花火车测试分裂 [英] Spark train test split

查看:108
本文介绍了火花火车测试分裂的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇是否有类似于sklearn的 http:在最新的2.0.1版本中///scikit-learn.org/stable/modules/generation/sklearn.model_selection.StratifiedShuffleSplit.html 用于apache-spark.

I am curious if there is something similar to sklearn's http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spark in the latest 2.0.1 release.

到目前为止,我只能找到 https://spark. apache.org/docs/latest/mllib-statistics.html#stratified-sampling 似乎不太适合将严重失衡的数据集拆分为训练样本/测试样本.

So far I could only find https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling which does not seem to be a great fit for splitting heavily imbalanced dataset into train /test samples.

推荐答案

Spark支持分层样本,如

Spark supports stratified samples as outlined in https://s3.amazonaws.com/sparksummit-share/ml-ams-1.0.1/6-sampling/scala/6-sampling_student.html

df.stat.sampleBy("label", Map(0 -> .10, 1 -> .20, 2 -> .3), 0)

这篇关于火花火车测试分裂的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆