将数据框转换为libsvm格式 [英] convert dataframe to libsvm format

查看:96
本文介绍了将数据框转换为libsvm格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自sql查询的数据框

I have a dataframe resulting from a sql query

df1 = sqlContext.sql("select * from table_test")

我需要将此数据帧转换为libsvm格式,以便可以将其作为

I need to convert this dataframe to libsvm format so that it can be provided as an input for

pyspark.ml.classification.LogisticRegression

我尝试执行以下操作.但是,这导致以下错误,因为我使用的是spark 1.5.2

I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2

df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm

我想改用MLUtils.loadLibSVMFile.我在防火墙后面,无法直接点子安装它.因此,我下载了文件,对其进行了剪裁,然后手动安装.一切似乎都正常,但我仍然遇到以下错误

I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and can't directly pip install it. So I downloaded the file, scp-ed it and then manually installed it. Everything seemed to work fine but I still get the following error

import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils

问题1:是我上面的方法,用于以正确的方向将数据帧转换为libsvm格式.问题2:如果对问题1表示是",则如何使MLUtils工作.如果为否",那么将数据框转换为libsvm格式的最佳方法是什么

Question 1: Is my above approach to convert dataframe to libsvm format in the right direction. Question 2: If "yes" to question 1, how to get MLUtils working. If "no", what is the best way to convert dataframe to libsvm format

推荐答案

我会那样做(这只是一个带有任意数据帧的示例,我不知道您的df1是如何完成的,重点是数据转换):

I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):

这是将数据框转换为libsvm格式的方法:

This is my way to convert dataframe to libsvm format:

# ... your previous imports

from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint

# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  3|  6|  
|  4|  5| 20|
|  7|  8|  8|
+---+---+---+

# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]

# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]

# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")

在"/your/Path/nameFolder/part-0000 *"文件上看到的是:

What you will see on the "/your/Path/nameFolder/part-0000*" files is:

1.0 1:3.0 2:6.0

1.0 1:3.0 2:6.0

4.0 1:5.0 2:20.0

4.0 1:5.0 2:20.0

7.0 1:8.0 2:8.0

7.0 1:8.0 2:8.0

有关LabeledPoint,请参见此处文档

See here for LabeledPoint docs

这篇关于将数据框转换为libsvm格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆