Patsy:测试数据中类别字段的新级别 [英] Patsy: New levels in categorical fields in test data

查看:199
本文介绍了Patsy:测试数据中类别字段的新级别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Patsy(与sklearn,pandas一起)来创建简单的回归模型. R样式公式的创建是一大亮点.

I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.

我的数据包含一个名为" ship_city "的字段,该字段可以包含印度的任何城市.由于我将数据划分为训练集和测试集,因此有几个城市仅出现在一组集中.下面是一个代码段:

My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:

df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')

最后一行引发以下错误:

The last line throws the following error:

patsy.PatsyError:将数据转换为分类错误:观察 值加尔各答"与任何预期水平都不匹配

patsy.PatsyError: Error converting data to categorical: observation with value 'Kolkata' does not match any of the expected levels

我认为这是一个非常普遍的用例,其中训练数据不会具有所有类别的所有级别. Sklearn的 DictVectorizer 处理得很好.

I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.

有什么办法可以使我和Patsy一起工作吗?

Is there any way I can make this work with Patsy?

推荐答案

当然,问题是,如果仅给patsy原始值列表,则无法知道还有其他可能发生的值,例如出色地.您必须以某种方式告诉它什么是完整的可能值集.

The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.

一种方法是使用C(...)levels=参数,例如:

One way is by using the levels= argument to C(...), like:

# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))

dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)

如果您使用的是熊猫的默认分类支持,则另一个选择是在设置数据时记录可能的值集框架;如果patsy检测到您通过的对象是熊猫分类的,那么它将自动使用pandas Categories属性,而不是通过查看数据来猜测可能的分类.

Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.

这篇关于Patsy:测试数据中类别字段的新级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆