Spark 火车测试拆分 [英] Spark train test split
问题描述
我很好奇是否有类似于 sklearn 的 http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html 用于最新 2.0.1 版本中的 apache-spark.
I am curious if there is something similar to sklearn's http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spark in the latest 2.0.1 release.
到目前为止,我只能找到 https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling 这似乎不太适合将严重不平衡的数据集拆分为训练/测试样本.
So far I could only find https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling which does not seem to be a great fit for splitting heavily imbalanced dataset into train /test samples.
推荐答案
Spark 支持分层样本,如 https://s3.amazonaws.com/sparksummit-share/ml-ams-1.0.1/6-sampling/scala/6-sampling_student.html
Spark supports stratified samples as outlined in https://s3.amazonaws.com/sparksummit-share/ml-ams-1.0.1/6-sampling/scala/6-sampling_student.html
df.stat.sampleBy("label", Map(0 -> .10, 1 -> .20, 2 -> .3), 0)
这篇关于Spark 火车测试拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!