问题在Spark和IPython的非编码数字功能数字 [英] issue in encoding non-numeric feature to numeric in Spark and Ipython

查看：189 发布时间：2016/5/22 15:47:35 python apache-spark machine-learning dataframe pyspark

本文介绍了问题在Spark和IPython的非编码数字功能数字的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我工作的事，我必须让predictions为数字数据（每月员工支出）使用非数字功能。我使用星火MLlibs 随机森林algorthim 。我有我的功能在数据帧看起来像这样的数据：

I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Spark MLlibs Random Forests algorthim. I have my features data in a dataframe which looks like this:

     _1      _2     _3              _4  
0  Level1    Male  New York         New York   
1  Level1    Male  San Fransisco    California   
2  Level2    Male  New York         New York   
3  Level1    Male  Columbus         Ohio   
4  Level3    Male  New York         New York   
5  Level4    Male  Columbus         Ohio   
6  Level5    Female  Stamford       Connecticut   
7  Level1    Female  San Fransisco  California   
8  Level3    Male  Stamford         Connecticut   
9  Level6    Female  Columbus       Ohio

下面列 - 员工级别，性别，城市，状态，这些都是使用我要让员工每月支出predictions我的功能（标签，在$）。


Here columns are - employee level,gender,city,state and these are my features using which I want to make predictions of employee monthly spending(the label,in $).
训练标签集是这样的：
3528
4958
4958
1652
4958
6528
4958
4958
5528
7000

由于功能在非数字的形式，所以我需要连接$ C $ç他们数字。所以我下面此链接为en code 分类数据到数字。我写这本code（在链接的文章中提到的过程如下）：
Since the features are in non-numeric form so I need to encode them to numeric. So I am following this link to encode categorical data into numbers. I wrote this code for this (following the process mentioned in linked article):
import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV
import pandas as pd
def extract(line):
    return (line[1],line[2],line[3],line[7],line[9],line[10],line[22])

inputfile = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)


input_data = (inputfile
    .map(lambda line: line.split(","))
    .filter(lambda line: len(line) >1 )
    .map(extract)) # Map to tuples

(train_data, test_data) = input_data.randomSplit([0.8, 0.2])

# converting RDD to dataframe
train_dataframe = train_data.toDF()
# converting to pandas dataframe
train_pandas = train_dataframe.toPandas()
# filtering features
train_pandas_features = train_pandas.iloc[:,:6]
# filtering label
train_pandas_label = train_pandas.iloc[:,6]

train_pandas_features_dict = train_pandas_features.T.to_dict().values()

# encoding features to numeric
vectorizer = DV( sparse = False )
vec_train = vectorizer.fit_transform( train_pandas_features_dict )

当我这样做打印vec_train 所有我看到的是 0 中的所有功能列。事情是这样的：
When I do print vec_train all I see is 0. in all features columns. Something like this:
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

我觉得什么地方，我做了一些错误，因为这本编码不产生正确的结果。什么错误我在干嘛？而有一些其他更好的连接code  非数字 功能的方式到数字因为我在上面描述的（predicting数字每月支出基于非数字的员工数据）？的情况下
I think somewhere I am making some mistake because of which this encoding is not producing correct result. What mistake am I doing? And is there some other better way to encode non-numeric features to numeric for the case I described at the top(predicting numeric monthly expenditure based on non-numeric employee data)?
推荐答案
一般来说和 scikit学习使用星火似乎是一个严重矫枉过正。不过，如果你这样做可能更有意义，使用火花工具的所有道路。让我们开始索引你的特点：
Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler

label_col = "x3"  # For example

# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
    .toDF(("x0", "x1", "x2", "x3")))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))

   # For classifications problems
   #   - if you want to use ML you should index label as well
   #   - if you want to use MLlib it is not necessary
   # For regression problems you should omit label in the indexing
   # as shown below
   for x in df.columns if x not in {label_col} # Exclude other columns if needed
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)

以上定义管道将创建下列数据帧：
Pipeline defined above will create following data frame:
indexed.printSchema()
## root
##  |-- x0: string (nullable = true)
##  |-- x1: string (nullable = true)
##  |-- x2: string (nullable = true)
##  |-- x3: string (nullable = true)
##  |-- idx_x0: double (nullable = true)
##  |-- idx_x1: double (nullable = true)
##  |-- idx_x2: double (nullable = true)
##  |-- features: vector (nullable = true)

其中，功能应该是 mllib.tree.DecisionTree 有效输入（见的 SPARK：？如何从LabeledPoint决策树创建categoricalFeaturesInfo ）
where features should be a valid input for mllib.tree.DecisionTree (see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).
您可以在它外面创建标签点如下：
You can create label points out of it as follows:
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

label_points = (indexed
    .select(col(label_col).alias("label"), col("features"))
    .map(lambda row: LabeledPoint(row.label, row.features)))


                        这篇关于问题在Spark和IPython的非编码数字功能数字的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

问题在Spark和IPython的非编码数字功能数字 [英] issue in encoding non-numeric feature to numeric in Spark and Ipython

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

问题在Spark和IPython的非编码数字功能数字 [英] issue in encoding non-numeric feature to numeric in Spark and Ipython

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭