逐列矢量化2D字符数组 [英] Vectorize 2D character array column-wise

查看:89
本文介绍了逐列矢量化2D字符数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个2D numpy数组,如下所示:

I have a 2D numpy array like the following:

a=np.array([["Science", "Blue", 3],
            ["Math", "Red", 4],
            ["Math", "Red", 5],
            ["Science", "Red", 3]])

我需要按列将其转换为数值,如下所示(期望的输出):

And I need to convert it into numeric values column wise, like the following (desired output):

out=np.array([[0, 0, 0],
              [1, 1, 1],
              [1, 1, 2], 
              [0, 1, 0]])

但是,为了便于下游解释,我还需要一个输出以从数字值追溯到原始值.我在想这样的事情:

However, for downstream interpretability, I also need to have an output to trace back from the numeric values to the original values. I was thinking something like this:

trace_back_dict = {0: {0: "Science", 1: "Math"}, 
                   1: {0: "Blue", 1: "Red"}, 
                   2: {0: 3, 1: 4, 2: 5}}

其中外键是原始数组的列索引,而内部dict则提供了数字字符值的映射.

Where the outer keys are the column indices from the original array and the inner dicts give the mapping of numeric: character value.

是否有一种简单的方法,最好是sklearn风格的东西,我可以先做fit_transform然后做transform(用于训练和测试装置)?

Is there an easy way of doing this, preferably something in sklearn style, where I can do a fit_transform, and then transform (for train and test set purposes)?

我正在查看sklearnLabelEncoder,基本上我需要在每一列上应用不同的列.关于如何有效执行此操作的任何建议?

I was looking at sklearn's LabelEncoder, and essentially what I need is to apply a different one on each column. Any suggestions on how to do this efficiently?

谢谢!

杰克

推荐答案

您可以使用 OrdinalEncoder :

In [25]: a = [['Science', 'Blue', 3], ['Math', 'Red', 4], ['Math', 'Red', 5], ['Science', 'Red', 3]]

In [26]: enc = sklearn.preprocessing.OrdinalEncoder()

In [27]: enc.fit(a)
Out[27]: OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)

In [28]: enc.transform(a)
Out[28]: 
array([[1., 0., 0.],
       [0., 1., 1.],
       [0., 1., 2.],
       [1., 1., 0.]])

In [29]: enc.categories_
Out[29]: 
[array(['Math', 'Science'], dtype=object),
 array(['Blue', 'Red'], dtype=object),
 array([3, 4, 5], dtype=object)]

In [30]: trace_back_dict = {i: dict(enumerate(v)) for i, v in enumerate(enc.categories_)}

In [31]: trace_back_dict
Out[31]: {0: {0: 'Math', 1: 'Science'}, 1: {0: 'Blue', 1: 'Red'}, 2: {0: 3, 1: 4, 2: 5}}

这篇关于逐列矢量化2D字符数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆