如何在 sklearn 中使用 Pandas DataFrames? [英] How to use pandas DataFrames with sklearn?

查看:57
本文介绍了如何在 sklearn 中使用 Pandas DataFrames?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的项目的目标是预测一些文本描述的准确度.

我用 FASTTEXT 制作了矢量.

TSV 输出:

0 1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 ...1 1:0.003118149 2:-0.015105667 3:0.040879637 4:0.000539902 ...

资源被标记为好 (1) 或坏 (0).

为了检查准确性,我使用了 scikit-learn 和 SVM.

按照教程我制作了这个脚本:

将熊猫导入为 pd从 sklearn.model_selection 导入 train_test_split从 sklearn 导入 svm从 sklearn 导入指标将 numpy 导入为 np导入 matplotlib.pyplot 作为 pltr_filenameTSV = 'TSV/A19784.tsv'tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])df = pd.DataFrame(tsv_read)df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),列 = ['标签','向量'])打印(特征:",df.vector)打印(标签:",df.label)X_train, X_test, y_train, y_test = train_test_split(df.vector, df.label, test_size=0.2,random_state=0)#创建一个svm分类器clf = svm.SVC(kernel='线性')#使用训练集训练模型clf.fit (str((X_train, y_train)))#预测测试数据集的响应y_pred = clf.predict(X_test)打印(准确度:",metrics.accuracy_score(y_test,y_pred))

我第一次尝试运行脚本时,在第 28 行出现了这个错误:

ValueError: 无法将字符串转换为浮点数:

所以我改了

clf.fit (X_train, y_train)

<预><代码>clf.fit (str((X_train, y_train)))

然后,在同一行,我收到了这个错误

TypeError: fit() 缺少 1 个必需的位置参数:'y'

建议如何解决这个问题?

亲切的问候和感谢您的时间.

解决方案

就像在您的问题下面的评论中提到的那样,您的功能和标签可能是字符串.但是,sklearn 要求它们是数字的(sklearn 通常与 numpy 数组一起使用).如果是这种情况,您必须将数据框的元素从字符串转换为数值.

查看您的代码,我假设您的功能列的每个元素都是一个字符串列表,而您的标签列的每个元素都是一个字符串.以下是如何将此类数据帧转换为包含数值的示例.

将 numpy 导入为 np将熊猫导入为 pddf = pd.DataFrame({'features': [['5', '4.2'], ['3', '7.9'], ['2', '9']],'标签': ['1', '0', '0']})打印(类型(df.features[0][0]))打印(类型(df.label[0]))def convert_to_float(集合):floats = [float(el) for el in collection]返回 np.array(floats)df_numeric = pd.concat([df["features"].apply(convert_to_float),pd.to_numeric(df["label"])],轴=1)打印(类型(df_numeric.features[0][0]))打印(类型(df_numeric.label[0]))

然而,所描述的数据帧格式不是 sklearn 模型期望 Pandas 数据帧具有的格式.据我所知,sklearn 模型希望每个特征都存储在一个单独的列中,就像这里的情况:

from sklearn.model_selection import train_test_split从 sklearn.svm 导入 SVCfeature_df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=["feature_1", "feature_2"])label_df = pd.DataFrame(np.array([[1], [0], [0]]), columns=["label"])df = pd.concat([feature_df, label_df],axis=1)X_train, X_test, y_train, y_test = train_test_split(df.drop(["label"], axis=1), df["label"], test_size=1/3)clf = SVC(内核=线性")clf.fit(X_train, y_train)clf.predict(X_test)

也就是说,在将数据框转换为仅包含数值之后,您必须为特征列的列表中的每个元素创建一个自己的列.你可以这样做:

arr = np.concatenate(df_numeric.features.to_numpy()).reshape(df_numeric.shape)df_sklearn_compatible = pd.concat([pd.DataFrame(arr, columns=["feature_1", "feature_2"]),df["标签"]],轴=1)

The goal of my project is to predict the accuracy level of some textual descriptions.

I made the vectors with FASTTEXT.

TSV output:

0  1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 ...
1  1:0.003118149 2:-0.015105667 3:0.040879637 4:0.000539902 ...

Resources are labelled as Good (1) or Bad (0).

To check the accuracy I used scikit-learn and SVM.

Following this tutorial I made this script:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt

r_filenameTSV = 'TSV/A19784.tsv'

tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame(tsv_read)

df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
                                   columns = ['label','vector'])


print ("Features:" , df.vector)

print ("Labels:" , df.label)

X_train, X_test, y_train, y_test = train_test_split(df.vector, df.label, test_size=0.2,random_state=0)

#Create a svm Classifier
clf = svm.SVC(kernel='linear') 

#Train the model using the training sets
clf.fit (str((X_train, y_train)))

#Predict the response for test dataset
y_pred = clf.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

The first time I tried to run the script I got this error on line 28:

ValueError: could not convert string to float:

So I changed from

clf.fit (X_train, y_train)

to


clf.fit (str((X_train, y_train)))

Then, on the same line, I got this error

TypeError: fit() missing 1 required positional argument: 'y'

Suggestions how to solve this issue?

kind regards and thanks for your time.

解决方案

Like mentioned in the comments below your question your features and your label are persumably strings. However, sklearn requires them to be numeric (sklearn is normally used with numpy arrays). If this is the case you have to convert the elements of your dataframe from strings to numeric values.

Looking at your code I assume that each element of your feature column is a list of strings and each element of your label column is a single string. Here is an example of how such a dataframe can be converted to contain numeric values.

import numpy as np
import pandas as pd

df = pd.DataFrame({'features': [['5', '4.2'], ['3', '7.9'], ['2', '9']],
                   'label': ['1', '0', '0']})
print(type(df.features[0][0]))
print(type(df.label[0]))


def convert_to_float(collection):
    floats = [float(el) for el in collection]
    return np.array(floats)


df_numeric = pd.concat([df["features"].apply(convert_to_float),
                pd.to_numeric(df["label"])],
               axis=1)
print(type(df_numeric.features[0][0]))
print(type(df_numeric.label[0]))

However, the described dataframe format is not the format sklearn models expect pandas dataframes to have. As far as I know sklearn models expect each feature to be stored in a seperate column, like it is the case here:

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

feature_df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=["feature_1", "feature_2"])
label_df = pd.DataFrame(np.array([[1], [0], [0]]), columns=["label"])
df = pd.concat([feature_df, label_df], axis=1)

X_train, X_test, y_train, y_test = train_test_split(df.drop(["label"], axis=1), df["label"], test_size=1 / 3)
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)
clf.predict(X_test)

That is, after converting your dataframe so that it only contains numeric values, you'd have to create an own column for each element in the lists of your feature column. You could do so like this:

arr = np.concatenate(df_numeric.features.to_numpy()).reshape(df_numeric.shape)
df_sklearn_compatible = pd.concat([pd.DataFrame(arr, columns=["feature_1", "feature_2"]),
                                   df["label"]],
                                  axis=1)

这篇关于如何在 sklearn 中使用 Pandas DataFrames?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆