SciKit-Learn标签编码器导致错误“参数必须是字符串或数字" [英] SciKit-Learn Label Encoder resulting in error 'argument must be a string or number'

查看:142
本文介绍了SciKit-Learn标签编码器导致错误“参数必须是字符串或数字"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有点困惑-在这里创建ML模型.

I'm a bit confused - creating an ML model here.

我正在尝试从大"数据框(180列)中获取分类特征并对其进行一次热分析,以便我可以找到特征之间的相关性并选择最佳"特征.

I'm at the step where I'm trying to take categorical features from a "large" dataframe (180 columns) and one-hot them so that I can find the correlation between the features and select the "best" features.

这是我的代码:

# import labelencoder
from sklearn.preprocessing import LabelEncoder

# instantiate labelencoder object
le = LabelEncoder()

# apply le on categorical feature columns
df = df.apply(lambda col: le.fit_transform(col))
df.head(10)

运行此程序时,出现以下错误:

When running this I get the following error:

TypeError: ('argument must be a string or number', 'occurred at index LockTenor')

所以我转到LockTenor字段并查看所有不同的值:

So I head over to the LockTenor field and look at all the distinct values:

df.LockTenor.unique()

这将导致以下结果:

array([60.0, 45.0, 'z', 90.0, 75.0, 30.0], dtype=object)

在我看来就像所有字符串和数字.是因为它是浮点数而不一定是INT引起的错误?

looks like all strings and numbers to me. Is the error caused because it's a float and not necessarily an INT?

推荐答案

您会收到此错误,因为确实有浮点数字符串的组合.看一下这个例子:

You get this error because indeed you have a combination of floats and strings. Take a look at this example:

# Preliminaries
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create DataFrames

# df1 has all floats
d1 = {'LockTenor':[60.0, 45.0, 15.0, 90.0, 75.0, 30.0]}
df1 = pd.DataFrame(data=d1)
print("DataFrame 1")
print(df1)

# df2 has a string in the mix
d2 = {'LockTenor':[60.0, 45.0, 'z', 90.0, 75.0, 30.0]}
df2 = pd.DataFrame(data=d2)
print("DataFrame 2")
print(df2)

# Create encoder
le = LabelEncoder()

# Encode first DataFrame 1 (where all values are floats)
df1 = df1.apply(lambda col: le.fit_transform(col), axis=0, result_type='expand')
print("DataFrame 1 encoded")
print(df1)

# Encode first DataFrame 2 (where there is a combination of floats and strings)
df2 = df2.apply(lambda col: le.fit_transform(col), axis=0, result_type='expand')
print("DataFrame 2 encoded")
print(df2)

如果运行此代码,您会发现df1的编码没有问题,因为它的所有值都是浮点数.但是,您将得到报告为df2的错误.

If you run this code, you will see that df1 is encoded with no problem, since all its values are floats. However, you will get the error that you are reporting for df2.

一个简单的解决方法是将列强制转换为字符串.您可以在相应的lambda函数中执行此操作:

An easy fix, is to cast the column to a string. You can do this in the corresponding lambda function:

df2 = df2.apply(lambda col: le.fit_transform(col.astype(str)), axis=0, result_type='expand')

作为其他建议,我建议您查看一下数据,看看是否正确.对我来说,在同一列中混合使用浮点数和字符串有点奇怪.

As an additional suggestion, I would recommend you take a look at your data and see if they are correct. For me, it is a bit weird having a mix of floats and strings in the same column.

最后,我想指出的是 sci-kit的LabelEncoder执行变量的简单编码执行一次性编码.如果您愿意,我建议您看一下 OneHotEncoder

Finally, I would just like to point out that sci-kit's LabelEncoder performs a simple encoding of variables, it does not performe one-hot encoding. If you wish to do so, I recommend you take a look at OneHotEncoder

这篇关于SciKit-Learn标签编码器导致错误“参数必须是字符串或数字"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆