Vowpal Wabbit如何表示分类特征 [英] Vowpal Wabbit how to represent categorical features

查看:87
本文介绍了Vowpal Wabbit如何表示分类特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下所有类别变量的数据:

I have the following data with all categorical variables:

    class  education    income    social_standing
    1       basic       low       good
    0        low        high      V_good
    1        high       low       not_good
    0        v_high     high      good

这里的教育分为四个级别(基础,低,高和v_high).收入有低和高两个层次;和social_standing具有三个级别(好,v_good和not_good).

Here education has four levels (basic, low, high and v_high). income has two levels low and high ; and social_standing has three levels (good, v_good and not_good).

就我对将上述数据转换为VW格式的理解而言,将是这样的:

In so far as my understanding of converting the above data to VW format is concerned, it will be something like this:

    1 |person education_basic income_low social_standing_good
    0 |person education_low income_high social_standing_v_good
    1 |person education_high income_low social_standing_not_good
    0 |person education_v_high income_high social_standing_good

在这里,人"是名称空间,所有其他都是要素值,并以各自的要素名称作为前缀.我对么?不知何故,特征值的这种表示对我来说很困惑.还有其他表示要素的方式吗?感谢您的帮助.

Here, 'person', is namespace and all other are feature values, prefixed by respective feature names. Am I correct? Somehow this representation of feature values is quite perplexing to me. Is there any other way to represent features? Shall be grateful for help.

推荐答案

是的,您是正确的.

这种表示形式肯定适用于vowpal兔子,但是在某些情况下, 可能不是最佳选择(取决于情况).

This representation would definitely work with vowpal wabbit, but under some conditions, may not be optimal (it depends).

要表示 无序 ,分类变量(具有离散值),标准的vowpal wabbit技巧是为每个可能的(名称,值)组合(例如person_is_good, color_blue, color_red).之所以起作用,是因为vw在任何缺少值的地方都会隐式假定1的值.除了内存中的哈希位置之外,color_red, color=redcolor_is_red甚至(color,red)color_red:1之间没有实际区别.您不能在变量名称中使用的唯一字符是特殊分隔符(:|)和空格.

To represent non-ordered, categorical variables (with discrete values), the standard vowpal wabbit trick is to use logical/boolean values for each possible (name, value) combination (e.g. person_is_good, color_blue, color_red). The reason this works is that vw implicitly assumes a value of 1 whereever a value is missing. There's no practical difference between color_red, color=red, color_is_red, or even (color,red) and color_red:1 except hash locations in memory. The only characters you can not use in a variable name are the special separators (: and |) and white-space.

术语注释:这种将每个(特征+值)对转换为单独特征的技巧有时称为一次热编码".

但是在这种情况下,变量值可能不是严格分类"的.他们可能是:

But in this case the variable-values may not be "strictly categorical". They may be:

  • 严格排序 ,例如(low < basic < high < v_high)
  • 大概与您要预测的标签具有 单调关系
  • Strictly ordered, e.g (low < basic < high < v_high)
  • Presumably have a monotonic relation with the label you're trying to predict

因此,通过将它们设置为严格分类"(我的术语是指具有上面两个属性的离散范围的变量),您可能会丢失一些有助于学习的信息.

so by making them "strict categorical" (my term for a variable with a discrete range which doesn't have the two properties above) you may be losing some information that may help learning.

在您的特定情况下,通过将值转换为数字,例如,可能会得到更好的结果. (1, 2, 3, 4)用于教育.也就是说,您可以使用类似的内容:

In your particular case, you may get better result by converting the values to numeric, e.g. (1, 2, 3, 4) for education. i.e you could use something like:

1 |person education:2 income:1 social_standing:2
0 |person education:1 income:2 social_standing:3
1 |person education:3 income:1 social_standing:1
0 |person education:4 income:2 social_standing:2

该问题中的训练集应该可以正常工作,因为即使像您一样将所有离散变量转换为布尔变量,vw也应该从数据本身使用标签自动发现排序和单调性,只要上述两个属性为真,并且有足够的数据可以推论它们.

The training set in the question should work fine, because even when you convert all your discrete variables into boolean variables like you did, vw should self-discover both the ordering and the monotonicity with the label from the data itself, as long as the two properties above are true, and there's enough data to deduce them.

这是在vowpal wabbit中编码变量的简短备忘单:

Here's the short cheat-sheet for encoding variables in vowpal wabbit:

Variable type       How to encode                readable example
-------------       -------------                ----------------
boolean             only encode the true case    is_alive
categorical         append value to name         color=green
ordinal+monotonic   :approx_value                education:2
numeric             :actual_value                height:1.85

最后的笔记:

  • vw中,所有变量都是数字.编码技巧只是使事物 出现 categoricalboolean的实用方法.布尔变量只是数字0或1;分类变量可以编码为布尔值:name + value:1.
  • 任何数值与标签不单调的变量,在进行数字编码时可能没什么用.
  • 与标签没有线性关系的任何变量都可以在训练之前受益于非线性变换.
  • 任何值为零的变量都不会对模型产生影响(例外:使用--initial_weight <value>选项时),因此可以将其从训练集中删除
  • 在解析功能时,只有:被视为特殊分隔符(在变量名称和其数值之间),其他所有内容都被视为名称的一部分,并且整个名称字符串被散列到内存中的某个位置.缺少:<value>部分意味着:1
  • In vw all variables are numeric. The encoding tricks are just practical ways to make things appear as categorical or boolean. Boolean variables are simply numeric 0 or 1; Categorical variables can be encoded as boolean: name+value:1.
  • Any variable whose value is not monotonic with the label, may be less useful when numerically encoded.
  • Any variable that is not linearly related to the label may benefit from a non-linear transformation before training.
  • Any variable with a zero value will not make a difference to the model (exception: when the --initial_weight <value> option is used) so it can be dropped from the training set
  • When parsing a feature, only : is considered a special separator (between the variable name and its numeric value) anything else is considered a part of the name and the whole name string is hashed to a location in memory. A missing :<value> part implies :1

名称空间如何?

名称空间之前带有特殊字符分隔符的要素名称,因此它们将相同的要素映射到不同的哈希位置.示例:

Name spaces are prepended to feature names with a special-char separator so they map identical features to different hash locations. Example:

|E low |I low

从本质上讲等同于(无名称空间的简单示例):

Is essentially equivalent to the (no name spaces flat example):

|  E^low:1 I^low:1

名称空间的主要用途是轻松地将名称空间的所有成员重新定义为其他名称,忽略要素的完整名称空间,将名称空间的要素与其他要素交叉等(请参见-q--cubic--redefine--ignore--keep选项).

The main use of name-spaces is to easily redefine all members of a name-space to something else, ignore a full name space of features, cross features of a name space with another etc. (see -q, --cubic, --redefine, --ignore, --keep options).

这篇关于Vowpal Wabbit如何表示分类特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆