具有分类变量(枚举)的神经网络作为输入 [英] Neural network with categorical variables (enum) as inputs

查看:821
本文介绍了具有分类变量(枚举)的神经网络作为输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用神经网络解决一些机器学习问题,主要是使用 NEAT evolution(增强拓扑的NeuroEvolution)。



我的一些输入变量是连续的,但其中一些是分类性质的,如:




  • 物种:{狮子,豹,虎,捷豹}

  • 贸易分支:{保健,保险,财务,IT,广告}



起初我想通过将类别映射到离散数字来建模这样一个变量,如:



{Lion:1,豹:2,虎:3,捷豹:4}



但是我恐怕这增加了一些任意拓扑的变量。老虎不是狮子和豹的总和。



通常会采用什么方法来解决这个问题?

解决方案

不幸的是没有好的解决方案,每个都会导致某些问题:




  • 您的解决方案正在添加拓扑,如您所提及的;这可能不是那么糟糕,因为NN可以适应任意的功能并代表ifs,但在许多情况下,它将(由于NN经常落入某些局部最小值)。

  • 您可以以 is_categoryical_feature_i_equal_j 的形式对数据进行编码,这不会导致任何额外的拓扑,但会增加功能数量成倍。因此,您可以获得is_lion,is_leopard等功能,并且只有其中一个等于 1 当时
  • $ b $如果大量数据与可能的分类值相比较(例如,您有10000个数据点,只有10个可能的分类值),则可以将问题分解成10个独立的数据,每个在一个对象上进行训练特别的价值(所​​以我们有狮子会的神经网络美洲虎的神经网络等)。


这两种方法是极端案例 - 一个计算量很低,但可能导致高偏差,而seond引入了很多复杂性,但不应影响分类过程本身。最后一个很少使用(由于假定有很少的分类值),但在机器学习方面却相当合理。


I'm trying to solve some machine-learning problems using neural networks, mostly with the NEAT evolution (NeuroEvolution of Augmented Topologies).

Some of my input variables are continuous, but some of them are of a categorical nature, like:

  • Species: {Lion,Leopard,Tiger,Jaguar}
  • Branches of Trade: {Health care,Insurances,Finance,IT,Advertising}

At first I wanted to model such a variable by mapping the categories to discrete numbers, like:

{Lion:1, Leopard:2, Tiger:3, Jaguar:4}

But I'm afraid this adds some kind of arbitrary topology on the variable. A Tiger is not the sum of a Lion and a Leopard.

What approaches to this problem are usually employed?

解决方案

Unfortunately there is no good solution, each leads to some kind of problems:

  • Your solution is adding the topology, as you mentioned; it may not be that bad, as NN can fit arbitrary functions and represent "ifs", but in many cases it will (as NN are often falling into some local minima).
  • You can encode your data in form of is_categorical_feature_i_equal_j, which won't induce any additional topology, but will grow the number of features exponentially. So instaed of "species" you get features "is_lion", "is_leopard", etc. and only one of them is equal 1 at the time
  • in case of large amount of data as compared to the possible categorical values (for example you have 10000 od data points, and only 10 possible categorical values) one can also split the problem into 10 independent ones, each trained on one particular value (so we have "neural network for lions" "neural network for jaguars" etc.)

These two first approaches are to "extreme" cases - one is very computationally cheap, but can lead to high bias, while the seond introduces much complexity, but should not influence the classification process itself. The last one is rarely usable (due to assumption of small number of categorical values) yet quite reasonable in terms of machine learning.

这篇关于具有分类变量(枚举)的神经网络作为输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆