Spark-从CSV文件创建(标签,特征)对的RDD [英] Spark - create RDD of (label, features) pairs from CSV file

查看:524
本文介绍了Spark-从CSV文件创建(标签,特征)对的RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,想对数据执行简单的LinearRegressionWithSGD.

I have a CSV file and want to perform a simple LinearRegressionWithSGD on the data.

一个示例数据如下(文件中的行总数为99,包括标签),目标是预测 y_3 变量:

A sample data is as follow (the total rows in the file is 99 including labels) and the objective is to predict the y_3 variable:

y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8
2995.3846153846152,17.0,1800.0,0.0,1.0,0.0,12.0
2236.304347826087,17.0,1432.0,1.0,0.0,0.0,12.0
2001.9512195121952,35.0,1432.0,0.0,1.0,0.0,5.0
992.4324324324324,17.0,1430.0,1.0,0.0,0.0,12.0
4386.666666666667,26.0,1430.0,0.0,0.0,1.0,25.0
1335.9036144578313,17.0,1432.0,0.0,1.0,0.0,5.0
1097.560975609756,17.0,1100.0,0.0,1.0,0.0,5.0
3526.6666666666665,26.0,1432.0,0.0,1.0,0.0,12.0
506.8421052631579,17.0,1430.0,1.0,0.0,0.0,5.0
2095.890410958904,35.0,1430.0,1.0,0.0,0.0,12.0
720.0,35.0,1430.0,1.0,0.0,0.0,5.0
2416.5,17.0,1432.0,0.0,0.0,1.0,12.0
3306.6666666666665,35.0,1800.0,0.0,0.0,1.0,12.0
6105.974025974026,35.0,1800.0,1.0,0.0,0.0,25.0
1400.4624277456646,35.0,1800.0,1.0,0.0,0.0,5.0
1414.5454545454545,26.0,1430.0,1.0,0.0,0.0,12.0
5204.68085106383,26.0,1800.0,0.0,0.0,1.0,25.0
1812.2222222222222,17.0,1800.0,1.0,0.0,0.0,12.0
2763.5928143712576,35.0,1100.0,1.0,0.0,0.0,12.0

我已经使用以下命令读取数据:

I already read the data with the following command:

val data = sc.textFile(datadir + "/data_2.csv");

当我想使用以下命令创建(标签,要素)对的RDD时:

When I want to create a RDD of (label, features) pairs with the following command:

val parsedData = data.map { line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
    }.cache()

所以我不能继续训练模型,有帮助吗?

So I can not continue for training a model, any help?

P.S.我在Windows 7 x64中使用Scala IDE运行Spark.

P.S. I run the spark with Scala IDE in Windows 7 x64.

推荐答案

经过大量的努力,我找到了解决方案.第一个问题与标题行有关,第二个问题与映射功能有关.这是完整的解决方案:

After lots of efforts I found out the solution. The first problem was related to the header rows and the second was related to mapping function. Here is the complete solution:

//To read the file
val csv = sc.textFile(datadir + "/data_2.csv");

//To find the headers
val header = csv.first;

//To remove the header
val data = csv.filter(_(0) != header(0));

//To create a RDD of (label, features) pairs
val parsedData = data.map { line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
    }.cache()

我希望它可以节省您的时间.

I hope it can save your time.

这篇关于Spark-从CSV文件创建(标签,特征)对的RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆