Tensorflow如何在一列内使用多个输入来处理分类特征? [英] How Tensorflow handles categorical features with multiple inputs within one column?

查看:341
本文介绍了Tensorflow如何在一列内使用多个输入来处理分类特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有以下csv格式的数据:

For example, I have a data in the following csv format:

csv
col0  col1  col2  col3
1     A     E|A|C 3
0     B     D|F   2 
2     C     |     2 

用逗号分隔的每一列代表一个功能.通常,一个功能是单项功能(例如col0, col1, col3),但在这种情况下,col2的功能具有多个输入(由|分隔).

Each column seperated by comma represent one feature. Normally, a feature is one-hot(e.g. col0, col1, col3), but in this case, the feature for col2 has multiple inputs(seperated by |).

我确定tensorflow可以处理稀疏张量的一键热特征,但是我不确定它是否可以处理具有多个输入的特征,例如col2?

I'm sure tensorflow can handle one-hot feature with sparse tensor, but I'm not sure whether it could handle features with multiple inputs like col2?

如何在Tensorflow的稀疏张量中表示它?

How should it be represented in Tensorflow's sparse tensor?

我正在使用下面的代码(但我不知道col2的输入方法)

I am using the code below (but i don't know input method of col2)

col0 = tf.feature_column.numeric_column('ID')
col1 = tf.feature_column.categorical_column_with_hash_bucket('Title', hash_bucket_size=1000)
col3 = tf.feature_column.numeric_column('Score')

columns = [col0, col1, col3]

tf.estimator.DNNClassifier(
        model_dir=None,
        feature_columns=columns,
        hidden_units=[10, 10],
        n_classes=4
    )

感谢您的帮助.

推荐答案

确定看来编写自定义功能列对我来说可以完成相同的任务.

OK Looks like writing custom feature column worked for me with the same task.

我以HashedCategoricalColumn作为基础,并进行清理以仅使用字符串.应该添加类型检查.

I took HashedCategoricalColumn as a base, and cleaned up to work with strings only. Should add checks for type though.

class _SparseArrayCategoricalColumn(
    _CategoricalColumn,
    collections.namedtuple('_SparseArrayCategoricalColumn',
                           ['key', 'num_buckets', 'category_delimiter'])):

  @property
  def name(self):
    return self.key

  @property
  def _parse_example_spec(self):
    return {self.key: parsing_ops.VarLenFeature(dtypes.string)}

  def _transform_feature(self, inputs):
    input_tensor = inputs.get(self.key)
    flat_input = array_ops.reshape(input_tensor, (-1,))
    input_tensor = tf.string_split(flat_input, self.category_delimiter)

    if not isinstance(input_tensor, sparse_tensor_lib.SparseTensor):
      raise ValueError('SparseColumn input must be a SparseTensor.')

    sparse_values = input_tensor.values
    # tf.summary.text(self.key, flat_input)
    sparse_id_values = string_ops.string_to_hash_bucket_fast(
        sparse_values, self.num_buckets, name='lookup')


    return sparse_tensor_lib.SparseTensor(
        input_tensor.indices, sparse_id_values, input_tensor.dense_shape)


  @property
  def _variable_shape(self):
    if not hasattr(self, '_shape'):
        self._shape = tensor_shape.vector(self.num_buckets)
    return self._shape

  @property
  def _num_buckets(self):
    """Returns number of buckets in this sparse feature."""
    return self.num_buckets

  def _get_sparse_tensors(self, inputs, weight_collections=None,
                          trainable=None):
    return _CategoricalColumn.IdWeightPair(inputs.get(self), None)


def categorical_column_with_array_input(key,
                                        num_buckets, category_delimiter="|"):
  if (num_buckets is None) or (num_buckets < 1):
    raise ValueError('Invalid num_buckets {}.'.format(num_buckets))

  return _SparseArrayCategoricalColumn(key, num_buckets, category_delimiter)

然后可以通过嵌入/指示符列对其进行包装. 似乎正是您所需要的. 对我来说这是第一步.我需要使用"str:float | str:float ..."之类的值来处理列.

Then it may be wrapped by embedding/indicator column. Seems it is what you need. It was first step for me. I need to handle column with values like "str:float|str:float...".

这篇关于Tensorflow如何在一列内使用多个输入来处理分类特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆