如何处理ML分类中的字符串数据 [英] How to handles string data in ML classification

查看:97
本文介绍了如何处理ML分类中的字符串数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,我是机器学习的初学者,我以前曾处理过一些二进制的ml任务,这些任务中的数据都是数字的.现在,我面临一个必须找到特定组合可能性的问题.我目前无法透露数据集或代码.我的数据是10列的数据框.我必须在8列上训练我的模型,并预测最后2列的可能性.那是我的标签是最后两列的组合.我面临的问题是,这些列值不是数字.我已经尝试了遇到的所有问题,但是找不到将其转换为数值的任何合适方法.我已经尝试了sklearn的LabelEncoder,该标签可与标签一起使用,但是如果再次使用它会引发内存错误.我尝试了从pandas读取to_numeric的方法,该方法将所有值读取为Nan.值的格式为"2be74fad-4d4".有关如何处理此问题的任何建议将不胜感激.

Hello I am a beginner in Machine Learning, I have previously worked with some binary ml tasks where the data was numerical. Now I am facing an issue where I have to find the probability of a particular combination. I can not disclose the dataset or the code at this point. My data is a dataframe of 10 columns. I have to train my model on 8 columns and predict the possibility of the last 2 columns. That is my labels are a combination of the last 2 columns. What I am facing a problem with is, these column values are not numerical. I have tried everything I came across but can't find any suitable means of converting this to numerical values. I have tried LabelEncoder from sklearn,which works with the labels, but throws memory error if I use it again. I have tried to_numeric from pandas, which reads all the values as Nan. The values are in the form '2be74fad-4d4'. Any suggestions would be highly appreciated about how to handle this issue.

推荐答案

要将分类数据转换为数值,可以在sklearn中尝试以下方法:

To convert categorical data to numerical, you can try these approaches in sklearn:

  1. 标签编码
  2. 标签Binarizer
  3. OneHot编码
  1. Label Encoding
  2. Label Binarizer
  3. OneHot Encoding

现在,对于您的问题,可以使用LabelEncoder.但是有一个问题!在其他sklearn模型中,您可以声明一次,然后使用它进行拟合,然后在许多列上进行转换.

Now, for your problem, you can use LabelEncoder. But there is a catch. In other sklearn models, you can declare it once and then use it to fit and then transform on a number of columns.

在LabelEncoding中,您必须在火车数据的一列上fit_transform模型,然后在测试数据的同一列上transform.然后,对下一个类别列进行相同的处理.

In LabelEncoding, you have to fit_transform the model on one column in train data and then transform the same column in test data. Then the same process for the next categorial column.

您可以遍历分类列列表以使其变得简单.请考虑以下代码段:

You can iterate over a list of categorical columns to make it simple. Consider the snippet below:

cat_cols = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 
         'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined']
enc = LabelEncoder()

for col in cat_cols:
    train[col] = train[col].astype('str')
    test[col] = test[col].astype('str')
    train[col] = enc.fit_transform(train[col])
    test[col] = enc.transform(test[col])

这篇关于如何处理ML分类中的字符串数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆