无法从泰坦尼克号比赛中将字符串转换为浮点错误 [英] Could not convert string to float error from the Titanic competition
问题描述
我正在尝试解决Kaggle的《泰坦尼克号》生存计划.这是我真正学习机器学习的第一步.我在性别列导致错误的地方遇到了问题.stacktrace表示无法将字符串转换为float:女性"
.你们是怎么遇到这个问题的?我不想要解决方案.我只想要一种解决此问题的实用方法,因为我确实需要性别"列来构建模型.
I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'
. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.
这是我的代码:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)
推荐答案
有几种解决方法,这取决于您要查找的内容:
There are a couple ways to deal with this, and it kind of depends what you're looking for:
- 您可以将类别编码为数值,即将类别的每个级别转换为不同的数字,
- You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,
或
- 虚拟代码您的类别,即转类别的每个级别都放在一个单独的列中,该列的值为
0
或1
.
- dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of
0
or1
.
在许多机器学习应用程序中,因素最好作为虚拟代码来处理.
In lots of machine learning applications, factors are better to deal with as dummy codes.
请注意,在2级类别的情况下,根据以下概述的方法编码为数字基本上等同于伪编码:所有非级别 0
的值都必须为级别<代码> 1 .实际上,在下面给出的伪代码示例中,存在冗余信息,因为我为2个类中的每个类提供了自己的列.只是为了说明概念.通常,只会创建 n-1
列,其中 n
是级别数,而隐含的级别是隐含的( ie 列,其中所有 0
值都隐含为 Male
.
Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0
are necessarily level 1
. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1
columns, where n
is the number of levels, and the omitted level is implied (i.e. make a column for Female
, and all the 0
values are implied to be Male
).
方法1: pd.分解
pd.factorize
是一种简单、快速的数字编码方式:
pd.factorize
is a simple, fast way of encoding to numeric:
例如,如果您的列"性别
"如下所示:
For example, if your column gender
looks like this:
>>> df
gender
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Female
8 Female
9 Female
df['gender_factor'] = pd.factorize(df.gender)[0]
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
方法2: categorical
dtype
另一种方法是使用 category
dtype:
Another way would be to use category
dtype:
df['gender_factor'] = df['gender'].astype('category').cat.codes
这将导致相同的输出
方法3 sklearn.preprocessing.LabelEncoder()
此方法具有一些优点,例如易于向后转换:
This method comes with some bonuses, such as easy back transforming:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
# Easy to back transform:
df['gender_factor'] = le.inverse_transform(df.gender_factor)
>>> df
gender gender_factor
0 Female Female
1 Male Male
2 Male Male
3 Male Male
4 Female Female
5 Female Female
6 Male Male
7 Female Female
8 Female Female
9 Female Female
虚拟代码:
方法1: pd.get_dummies
df.join(pd.get_dummies(df.gender))
gender Female Male
0 Female 1 0
1 Male 0 1
2 Male 0 1
3 Male 0 1
4 Female 1 0
5 Female 1 0
6 Male 0 1
7 Female 1 0
8 Female 1 0
9 Female 1 0
注意,如果您想省略一列以获得非冗余的伪代码(请参阅本答案开头的注释),则可以使用:
Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:
df.join(pd.get_dummies(df.gender, drop_first=True))
gender Male
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
这篇关于无法从泰坦尼克号比赛中将字符串转换为浮点错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!