问题在Spark和IPython的非编码数字功能数字 [英] issue in encoding non-numeric feature to numeric in Spark and Ipython

查看:189
本文介绍了问题在Spark和IPython的非编码数字功能数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的事,我必须让predictions为数字数据(每月员工支出)使用非数字功能。我使用星火MLlibs 随机森林algorthim 。我有我的功能数据帧看起来像这样的数据:

I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Spark MLlibs Random Forests algorthim. I have my features data in a dataframe which looks like this:

     _1      _2     _3              _4  
0  Level1    Male  New York         New York   
1  Level1    Male  San Fransisco    California   
2  Level2    Male  New York         New York   
3  Level1    Male  Columbus         Ohio   
4  Level3    Male  New York         New York   
5  Level4    Male  Columbus         Ohio   
6  Level5    Female  Stamford       Connecticut   
7  Level1    Female  San Fransisco  California   
8  Level3    Male  Stamford         Connecticut   
9  Level6    Female  Columbus       Ohio   

下面列 - 员工级别性别城市状态,这些都是使用我要让员工每月支出predictions我的功能(标签,在$)。

Here columns are - employee level,gender,city,state and these are my features using which I want to make predictions of employee monthly spending(the label,in $).

训练标签集是这样的:

3528
4958
4958
1652
4958
6528
4958
4958
5528
7000

由于功能非数字的形式,所以我需要连接$ C $ç他们数字。所以我下面此链接为en code 分类数据数字。我写这本code(在链接的文章中提到的过程如下):

Since the features are in non-numeric form so I need to encode them to numeric. So I am following this link to encode categorical data into numbers. I wrote this code for this (following the process mentioned in linked article):

import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV
import pandas as pd
def extract(line):
    return (line[1],line[2],line[3],line[7],line[9],line[10],line[22])

inputfile = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)


input_data = (inputfile
    .map(lambda line: line.split(","))
    .filter(lambda line: len(line) >1 )
    .map(extract)) # Map to tuples

(train_data, test_data) = input_data.randomSplit([0.8, 0.2])

# converting RDD to dataframe
train_dataframe = train_data.toDF()
# converting to pandas dataframe
train_pandas = train_dataframe.toPandas()
# filtering features
train_pandas_features = train_pandas.iloc[:,:6]
# filtering label
train_pandas_label = train_pandas.iloc[:,6]

train_pandas_features_dict = train_pandas_features.T.to_dict().values()

# encoding features to numeric
vectorizer = DV( sparse = False )
vec_train = vectorizer.fit_transform( train_pandas_features_dict )

当我这样做打印vec_train 所有我看到的是 0 中的所有功能列。事情是这样的:

When I do print vec_train all I see is 0. in all features columns. Something like this:

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

我觉得什么地方,我做了一些错误,因为这本编码不产生正确的结果。什么错误我在干嘛?而有一些其他更好的连接code 非数字 功能的方式数字因为我在上面描述的(predicting数字每月支出基于非数字的员工数据)?的情况下

I think somewhere I am making some mistake because of which this encoding is not producing correct result. What mistake am I doing? And is there some other better way to encode non-numeric features to numeric for the case I described at the top(predicting numeric monthly expenditure based on non-numeric employee data)?

推荐答案

一般来说和 scikit学习使用星火似乎是一个严重矫枉过正。不过,如果你这样做可能更有意义,使用火花工具的所有道路。让我们开始索引你的特点:

Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:

from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler

label_col = "x3"  # For example

# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
    .toDF(("x0", "x1", "x2", "x3")))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))

   # For classifications problems
   #   - if you want to use ML you should index label as well
   #   - if you want to use MLlib it is not necessary
   # For regression problems you should omit label in the indexing
   # as shown below
   for x in df.columns if x not in {label_col} # Exclude other columns if needed
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)

以上定义管道将创建下列数据帧:

Pipeline defined above will create following data frame:

indexed.printSchema()
## root
##  |-- x0: string (nullable = true)
##  |-- x1: string (nullable = true)
##  |-- x2: string (nullable = true)
##  |-- x3: string (nullable = true)
##  |-- idx_x0: double (nullable = true)
##  |-- idx_x1: double (nullable = true)
##  |-- idx_x2: double (nullable = true)
##  |-- features: vector (nullable = true)

其中,功能应该是 mllib.tree.DecisionTree 有效输入(见的 SPARK:?如何从LabeledPoint决策树创建categoricalFeaturesInfo

where features should be a valid input for mllib.tree.DecisionTree (see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).

您可以在它外面创建标签点如下:

You can create label points out of it as follows:

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

label_points = (indexed
    .select(col(label_col).alias("label"), col("features"))
    .map(lambda row: LabeledPoint(row.label, row.features)))

这篇关于问题在Spark和IPython的非编码数字功能数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆