一种热编码期间的 RunTimeError [英] RunTimeError during one hot encoding

查看:45
本文介绍了一种热编码期间的 RunTimeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中类值从 -2 到 2 步 (即 -2,-1,0,1,2),其中 9 标识未标记的数据.使用一种热编码

I have a dataset where class values go from -2 to 2 by 1 step (i.e., -2,-1,0,1,2) and where 9 identifies the unlabelled data. Using one hot encode

self._one_hot_encode(labels)

我收到以下错误:RuntimeError: index 1 is out of bounds for Dimension 1 with size 1

由于

self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)

错误应该从 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1,1, 1],其中我在映射设置中的 9 等于索引 9 到 1.我不清楚如何解决它,即使经过过去的问题和类似问题的答案(例如,索引 1 超出尺寸 0 的范围1).错误涉及的部分代码如下:

The error should raise from [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1], where I have 9 in the mapping setting equal index 9 to 1. It is unclear to me how to fix it, even after going through past questions and answers to similar problems (e.g., index 1 is out of bounds for dimension 0 with size 1). The part of code involved in the error is the following:

def _one_hot_encode(self, labels):
    # Get the number of classes
    classes = torch.unique(labels)
    classes = classes[classes != 9] # unlabelled 
    self.n_classes = classes.size(0)

    # One-hot encode labeled data instances and zero rows corresponding to unlabeled instances
    unlabeled_mask = (labels == 9)
    labels = labels.clone()  # defensive copying
    labels[unlabeled_mask] = 0
    self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
    self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
    self.one_hot_labels[unlabeled_mask, 0] = 0

    self.labeled_mask = ~unlabeled_mask

def fit(self, labels, max_iter, tol):
    
    self._one_hot_encode(labels)

    self.predictions = self.one_hot_labels.clone()
    prev_predictions = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)

    for i in range(max_iter):
        # Stop iterations if the system is considered at a steady state
        variation = torch.abs(self.predictions - prev_predictions).sum().item()
        

        prev_predictions = self.predictions
        self._propagate()

数据集示例:

ID  Target  Weight  Label   Score   Scale_Cat   Scale_num
0   A   D   65.1    1   87  Up  1
1   A   X   35.8    1   87  Up  1
2   B   C   34.7    1   37.5    Down    -2
3   B   P   33.4    1   37.5    Down    -2
4   C   B   33.1    1   37.5    Down    -2
5   S   X   21.4    0   12.5    NA  9

我用作参考的源代码在这里:https://mybinder.org/v2/gh/thibaudmartinez/label-propagation/master?filepath=notebook.ipynb

The source code I am using as reference is here: https://mybinder.org/v2/gh/thibaudmartinez/label-propagation/master?filepath=notebook.ipynb

完整的错误跟踪:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-126-792a234f63dd> in <module>
      4 label_propagation = LabelPropagation(adj_matrix_t)
----> 6 label_propagation.fit(labels_t) # causing error
      7 label_propagation_output_labels = label_propagation.predict_classes()
      8 

<ipython-input-115-54a7dbc30bd1> in fit(self, labels, max_iter, tol)
    100 
    101     def fit(self, labels, max_iter=1000, tol=1e-3):
--> 102         super().fit(labels, max_iter, tol)
    103 
    104 ## Label spreading

<ipython-input-115-54a7dbc30bd1> in fit(self, labels, max_iter, tol)
     58             Convergence tolerance: threshold to consider the system at steady state.
     59         """
---> 60         self._one_hot_encode(labels)
     61 
     62         self.predictions = self.one_hot_labels.clone()

<ipython-input-115-54a7dbc30bd1> in _one_hot_encode(self, labels)
     42         labels[unlabeled_mask] = 0
     43         self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
---> 44         self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
     45         self.one_hot_labels[unlabeled_mask, 0] = 0
     46 

RuntimeError: index 1 is out of bounds for dimension 1 with size 1

推荐答案

我浏览了您的笔记本(我认为您将 9 更改为 -1 以便运行)并看到这部分代码的内容:

I ran through your notebook (I think you changed the 9 to -1 for things to run) and saw that for this part of the code:

# Learn with Label Propagation
label_propagation = LabelPropagation(adj_matrix_t)
print("Label Propagation: ", end="")
label_propagation.fit(labels_t)
label_propagation_output_labels = label_propagation.predict_classes()

最终调用:

self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)

是哪里出了问题.

Is where things were going wrong.

花点时间在这里阅读有关 scatter 的 pytorch 手册:torch Scatter 并且我们了解到对于 scatter 来说,理解 dim、index、src 和 self 矩阵很重要.对于一种热编码,dim=1 或 0 无关紧要,我们的 src 矩阵为 1(稍后我们将对此进行更多研究).您现在正在使用 [40,1] 的索引矩阵和 [40,5] 的结果(自身)矩阵调用维度 1 上的 scatter.

Take a brief moment to read the pytorch manual on scatter here: torch Scatter and we learn that for scatter it's important to understand the dim, index, src and self matrixes. For one hot encoding, dim=1 or 0 doesn't matter and our src matrix is 1 (We'll look a little more into this later). You are now calling scatter on dimension 1 with an index matrix of [40,1] and a result(self) matrix of [40,5].

我在这里看到两个问题:

  1. 您正在使用文字类别虚拟变量 (-2,-1,0,1,2) 作为索引矩阵中的编码索引.这将导致 scatter 在 src 矩阵中搜索这些索引.这是索引越界的来源
  2. 您提到未标记的有 -2、-1、0、1、2 和 9 类 6 类,但您是 5 类的一种热编码.(是的,我知道您希望未标记的类全部为零,但使用 scatter 实现这一点有点困难.我稍后会解释).

那么我们如何解决这个问题?

So how do we fix this?

index = torch.tensor([[5],[0],[3],[5],[1],[4]]); print(index.shape); print(index)
result = torch.zeros(6, 6, dtype=src.dtype).scatter_(1, index, src); print(result.shape); print(result)

这会给我们

torch.Size([6, 1])
tensor([[5],
        [0],
        [3],
        [5],
        [1],
        [4]])
torch.Size([6, 6])
tensor([[0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 1],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]])

索引矩阵是 6 个观察值,其中 1 个观察值(类别)Self 矩阵是 6 个观测值,具有 6 个类别 1 的热编码向量scatter(dim=1) 创建 self 矩阵的方式是 torch 首先检查行(观察),然后将该行的值更改为存储在同一行但在列的 src 矩阵中的值的值存储在索引中的值.

Index matrix is 6 observations with 1 observed value (category) Self matrix is 6 observations with a 6 category one hot encoding vector The way that scatter(dim=1) creates the self matrix is torch first checks the row (observation) and then changes the value of that row to the value of the value stored in the src matrix at the same row but at the column of the value stored in index.

self[i][index[i][j][k]][k] = src[i][j][k]

因此,在您的情况下,您试图将 1 的值应用到 self[40,1] 中 index[0] 列(等于 1)的一行中.给你问题中的错误.虽然我检查了你的笔记本,错误是对于大小为 5 的维度 1,索引 -1 超出范围.它们都是相同的根本原因.

So in your case you were trying to apply the value of 1 into a row in self[40,1] at the column of index[0](which is equal to 1). Giving you the error in the question. Although I checked your notebook and the error is index -1 is out of bounds for dimension 1 with size 5. They are both the same root cause.

在这种情况下,用冷编码完成 one-hot 而不是 one-hot 更容易.原因是对于单热编码和冷编码,您需要在 src 矩阵中为每个未标记的观察创建一个 0 值.这比仅对 src 使用 1 更痛苦.还阅读此链接:OHE 的全零是否有效?我认为对每个类别使用 one-hot 更有意义.

It is just easier to do complete one-hot instead of one-hot with cold encodings in this case. The reason being is that for one-hot with cold encodings, you need to create a 0 value in your src matrix for every unlabelled observation. Which is much more painful than just using a 1 for the src. Also reading this link: Is it valid to have full zeros for OHE? I think it makes more sense to use one-hot for every category.

因此,对于第二个问题,我们只需要简单地将类别映射到 result/self 矩阵的索引中.由于我们有 6 个类别,因此只需将它们映射到 0、1、2、3、4、5.一个简单的 lambda 函数就可以解决问题.我使用随机采样器从类列表中获取数据标签,如下所示:(我从 6 个类中随机创建了 40 个观察值)

So, for the second issue we just need to simply map the categories in the indexes of the result/self matrix. Since we have 6 categories we just need to map them into 0,1,2,3,4,5. A simple lambda function would do the trick. I used a random sampler to get my data labels from a class list as shown below: (I randomly created 40 observations from 6 classes)

classes = list([-2,-1,0,1,2,9])

labels = list()
for i in range(0,40):
    labels.append(list([(lambda x: x+2 if x !=9 else 5)(random.sample(classes,1)[0])]))

index_aka_labels = torch.tensor(labels)
print(index_aka_labels)
print(index_aka_labels.shape)
torch.zeros(40, 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)

最后,我们实现了 OHE 的预期结果:

Finally, we have achieved our desired result of OHE:

tensor([[0, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1, 0],
        ... (40 observations)
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1],

这篇关于一种热编码期间的 RunTimeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆