在SciKit线性回归上获取"ValueError:形状未对齐" [英] Getting 'ValueError: shapes not aligned' on SciKit Linear Regression

查看:81
本文介绍了在SciKit线性回归上获取"ValueError:形状未对齐"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于SciKit和使用Python的线性代数/机器学习来说,这是相当新的,所以我似乎无法解决以下问题:

Quite new to SciKit and linear algebra/machine learning with Python in general, so I can't seem to solve the following:

我有一个训练集和一个数据测试集,其中包含连续值和离散/分类值. CSV文件被加载到Pandas DataFrames中,并且形状匹配,分别为(1460,81)和(1459,81). 但是,在使用熊猫的get_dummies 之后, DataFrames更改为(1460,306)和(1459,294).因此,当我使用 SciKit线性回归模块,它为306个变量建立模型,并尝试仅预测294个变量.这自然会导致以下错误:

I have a training set and a test set of data, containing both continuous and discrete/categorical values. The CSV files are loaded into Pandas DataFrames and match in shape, being (1460,81) and (1459,81). However, after using Pandas' get_dummies, the shapes of the DataFrames change to (1460, 306) and (1459, 294). So, when I do linear regression with the SciKit Linear Regression module, it builds a model for 306 variables and it tries to predict one with only 294 with it. This then, naturally, leads to the following error:

ValueError: shapes (1459,294) and (306,1) not aligned: 294 (dim 1) != 306 (dim 0)

我该如何解决这个问题?我能以某种方式重塑(1459,294)以匹配另一个吗?

How could I tackle such a problem? Could I somehow reshape the (1459, 294) to match the other one?

谢谢,我希望我已经说清楚了:)

Thanks and I hope I've made myself clear :)

推荐答案

在处理分类数据时,这是一个非常普遍的问题.关于如何最好地处理此问题,存在不同的看法.

This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.

一种可能的方法是将一个函数应用于限制一组可能选项的分类特征.例如,如果要素包含字母,则可以对A,B,C,D和其他/未知"要素进行编码.这样,您可以在测试时应用相同的功能并从问题中抽象出来.当然,显而易见的缺点是,通过减少特征空间,您可能会丢失有意义的信息.

One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.

另一种方法是在您的训练数据上建立模型,使用自然创建的任何假人,并将其作为模型的基准.当您在测试时使用模型进行预测时,您将以转换训练数据的方式来转换测试数据.例如,如果您的训练集的某个功能中包含字母,并且测试集中的同一特征包含"AA"值,则在进行预测时将忽略该值.这与您当前的情况相反,但前提是相同的.您需要动态创建缺少的功能.当然,这种方法也有缺点.

Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.

第二种方法是您在问题中提到的内容,因此我将在pandas中进行介绍.

The second approach is what you mention in your question, so I'll go through it with pandas.

通过使用get_dummies,您可以将分类特征编码为多个一键编码特征.您可以做的是使用reindex强制测试数据与培训数据匹配,如下所示:

By using get_dummies you're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:

test_encoded = pd.get_dummies(test_data, columns=['your columns'])
test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns, 
    fill_value=0)

这将以与训练数据相同的方式对测试数据进行编码,对于不是通过对测试数据进行编码而是在训练过程中创建的虚拟特征填充0.

This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.

您可以将其包装到一个函数中,然后即时将其应用于测试数据.如果创建数组或列名列表,则不需要内存中的编码训练数据(我可以通过training_encoded.columns访问).

You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.

这篇关于在SciKit线性回归上获取"ValueError:形状未对齐"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆