问题在Spark和IPython的非编码数字功能数字 [英] issue in encoding non-numeric feature to numeric in Spark and Ipython
问题描述
我工作的事,我必须让predictions为数字
数据(每月员工支出)使用非数字
功能。我使用星火MLlibs
随机森林algorthim
。我有我的功能
在数据帧
看起来像这样的数据:
I am working on something where I have to make predictions for numeric
data (monthly employee spending) using non-numeric
features. I am using Spark MLlibs
Random Forests algorthim
. I have my features
data in a dataframe
which looks like this:
_1 _2 _3 _4
0 Level1 Male New York New York
1 Level1 Male San Fransisco California
2 Level2 Male New York New York
3 Level1 Male Columbus Ohio
4 Level3 Male New York New York
5 Level4 Male Columbus Ohio
6 Level5 Female Stamford Connecticut
7 Level1 Female San Fransisco California
8 Level3 Male Stamford Connecticut
9 Level6 Female Columbus Ohio
下面列 - 员工级别
,性别
,城市
,状态
,这些都是使用我要让员工每月支出predictions我的功能(标签,在$)。
Here columns are - employee level
,gender
,city
,state
and these are my features
using which I want to make predictions of employee monthly spending(the label,in $).
训练标签集是这样的:
3528
4958
4958
1652
4958
6528
4958
4958
5528
7000
由于功能
在非数字
的形式,所以我需要连接$ C $ç
他们数字
。所以我下面此链接为en code 分类数据
到数字
。我写这本code(在链接的文章中提到的过程如下):
Since the features
are in non-numeric
form so I need to encode
them to numeric
. So I am following this link to encode categorical data
into numbers
. I wrote this code for this (following the process mentioned in linked article):
import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV
import pandas as pd
def extract(line):
return (line[1],line[2],line[3],line[7],line[9],line[10],line[22])
inputfile = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)
input_data = (inputfile
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(extract)) # Map to tuples
(train_data, test_data) = input_data.randomSplit([0.8, 0.2])
# converting RDD to dataframe
train_dataframe = train_data.toDF()
# converting to pandas dataframe
train_pandas = train_dataframe.toPandas()
# filtering features
train_pandas_features = train_pandas.iloc[:,:6]
# filtering label
train_pandas_label = train_pandas.iloc[:,6]
train_pandas_features_dict = train_pandas_features.T.to_dict().values()
# encoding features to numeric
vectorizer = DV( sparse = False )
vec_train = vectorizer.fit_transform( train_pandas_features_dict )
当我这样做打印vec_train
所有我看到的是 0
中的所有功能列。事情是这样的:
When I do print vec_train
all I see is 0.
in all features columns. Something like this:
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
我觉得什么地方,我做了一些错误,因为这本编码
不产生正确的结果。什么错误我在干嘛?而有一些其他更好的连接code
非数字
功能的方式
到数字
因为我在上面描述的(predicting数字每月支出基于非数字的员工数据)?的情况下
I think somewhere I am making some mistake because of which this encoding
is not producing correct result. What mistake am I doing? And is there some other better way to encode
non-numeric
features
to numeric
for the case I described at the top(predicting numeric monthly expenditure based on non-numeric employee data)?
推荐答案
一般来说和 scikit学习
使用星火似乎是一个严重矫枉过正。不过,如果你这样做可能更有意义,使用火花工具的所有道路。让我们开始索引你的特点:
Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn
using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler
label_col = "x3" # For example
# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
.toDF(("x0", "x1", "x2", "x3")))
# Indexers encode strings with doubles
string_indexers = [
StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
# For classifications problems
# - if you want to use ML you should index label as well
# - if you want to use MLlib it is not necessary
# For regression problems you should omit label in the indexing
# as shown below
for x in df.columns if x not in {label_col} # Exclude other columns if needed
]
# Assembles multiple columns into a single vector
assembler = VectorAssembler(
inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
outputCol="features"
)
pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)
以上定义管道将创建下列数据帧:
Pipeline defined above will create following data frame:
indexed.printSchema()
## root
## |-- x0: string (nullable = true)
## |-- x1: string (nullable = true)
## |-- x2: string (nullable = true)
## |-- x3: string (nullable = true)
## |-- idx_x0: double (nullable = true)
## |-- idx_x1: double (nullable = true)
## |-- idx_x2: double (nullable = true)
## |-- features: vector (nullable = true)
其中,功能
应该是 mllib.tree.DecisionTree
有效输入(见的 SPARK:?如何从LabeledPoint决策树创建categoricalFeaturesInfo )
where features
should be a valid input for mllib.tree.DecisionTree
(see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).
您可以在它外面创建标签点如下:
You can create label points out of it as follows:
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
label_points = (indexed
.select(col(label_col).alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))
这篇关于问题在Spark和IPython的非编码数字功能数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!