为什么在 Pytorch Tensor 上调用 .numpy() 之前调用 .detach() ? [英] Why do we call .detach() before calling .numpy() on a Pytorch Tensor?

查看:161
本文介绍了为什么在 Pytorch Tensor 上调用 .numpy() 之前调用 .detach() ?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已经确定 my_tensor.detach().numpy() 是从 torch 张量获取 numpy 数组的正确方法.

我正在努力更好地了解原因.

在刚刚链接的问题的接受的答案中,Blupon 指出:

<块引用>

您需要将张量转换为另一个除了实际值定义之外不需要梯度的张量.

在他链接的第一个讨论中,albanD 指出:

<块引用>

这是预期的行为,因为移动到 numpy 会破坏图形,因此不会计算梯度.

如果你实际上并不需要梯度,那么你可以显式地 .detach() 需要 grad 的 Tensor 以获得一个不需要 grad 的具有相同内容的张量.然后可以将这个另一个张量转换为一个 numpy 数组.

在他链接的第二个讨论中,apaszke 写道:

<块引用>

变量不能转化为 numpy,因为它们是张量的包装器,用于保存操作历史,而 numpy 没有这样的对象.您可以使用 .data 属性检索变量持有的张量.然后,这应该有效:var.data.numpy().

我研究了 PyTorch 的自动分化库的内部工作原理,但我仍然对这些答案感到困惑.为什么它会破坏图形以移动到 numpy?是不是因为在 autodiff 图中不会跟踪对 numpy 数组的任何操作?

什么是变量?它与张量有什么关系?

我觉得这里需要一个彻底的高质量 Stack-Overflow 答案,向尚不了解自动分化的 PyTorch 新用户解释原因.

特别是,我认为通过一个图形来说明图表并说明在这个例子中断开是如何发生的会很有帮助:

<块引用>

导入火炬tensor1 = torch.tensor([1.0,2.0],requires_grad=True)打印(张量1)打印(类型(张量1))张量1 = tensor1.numpy()打印(张量1)打印(类型(张量1))

解决方案

我认为理解这里最关键的一点是torch.tensor之间的区别代码>np.ndarray:
虽然这两个对象都用于存储 n 维矩阵(又名 Tensors"),torch.tensors 有一个额外的层";- 存储导致相关 n 维矩阵的计算图.

因此,如果您只对在矩阵上执行数学运算的高效且简单的方法感兴趣,np.ndarraytorch.tensor 可以互换使用.

然而,torch.tensors 被设计为在 的上下文中使用梯度下降优化,因此它们不仅包含带有数值的张量,而且(更重要的是)包含导致这些值的计算图.然后使用这个计算图(使用导数链规则)来计算损失的导数函数写入用于计算损失的每个自变量.

如前所述,np.ndarray 对象没有这个额外的计算图";因此,当将 torch.tensor 转换为 np.ndarray 时,您必须显式使用 删除张量的计算图detach() 命令.


计算图
从你的 评论 这个概念似乎有点模糊.我会试着用一个简单的例子来说明它.
考虑两个(向量)变量的简单函数,xw:

x = torch.rand(4, requires_grad=True)w = torch.rand(4, requires_grad=True)y = x @ w # x 和 w 的内积z = y ** 2 # 内积的平方

如果我们只对 z 的值感兴趣,我们不需要担心任何图形,我们只需从输入中向前移动,xw,计算 yz.

然而,如果我们不太关心z的值,而是想问什么是w"会怎样?最小化 z 对于给定的 x"?
要回答这个问题,我们需要计算 z w.r.t w.
导数我们该怎么做?
使用链式法则我们知道dz/dw = dz/dy * dy/dw.也就是说,要计算 zw 的梯度,我们需要移动 backwardz 回到 w 计算每一步操作的 gradient 跟踪 返回我们从zw.这个路径"我们回溯的是 z计算图,它告诉我们如何计算 z 的导数,并且输入导致 z:

z.backward() # 要求pytorch回溯z的计算

我们现在可以检查 z w.r.t w 的梯度:

<块引用>

w.grad # z w.r.t w 的结果梯度张量([0.8010, 1.9746, 1.5904, 1.0408])

注意这完全等于

<块引用>

2*y*x张量([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=)

因为 dz/dy = 2*ydy/dw = x.

沿路径的每个张量存储其贡献";到计算:

<块引用>

z张量(1.4061,grad_fn=)

<块引用>

y张量(1.1858,grad_fn=)

如您所见,yz 不仅存储forward"y**2 的值以及计算图 -- grad_fn当追溯从 z(输出)到 w(输入)的梯度时,需要计算导数(使用链式法则).

这些grad_fntorch.tensors 的重要组成部分,没有它们就无法计算复杂函数的导数.但是,np.ndarrays 根本没有这个能力,他们也没有这个信息.

请参阅此答案,了解有关使用 backwrd() 追溯衍生的更多信息功能.


由于 np.ndarraytorch.tensor 都有一个共同的层"存储一个 n 维数字数组,pytorch 使用相同的存储来节省内存:

<块引用>

numpy() → numpy.ndarray
self 张量作为 NumPy ndarray 返回.这个张量和返回的 ndarray 共享相同的底层存储.对自张量的更改将反映在 ndarray 中,反之亦然.

另一个方向也以同样的方式工作:

<块引用>

torch.from_numpy(ndarray) → 张量
从 numpy.ndarray 创建一个张量.
返回的张量和 ndarray 共享相同的内存.对张量的修改将反映在 ndarray 中,反之亦然.

因此,当从 torch.tensor 创建一个 np.array 或反之亦然时,两个对象引用内存中的相同底层存储.由于 np.ndarray 不存储/表示与数组关联的计算图,因此共享时应该使用 detach()显式删除该图numpy 和 torch 都希望引用相同的张量.


请注意,如果您出于某种原因希望仅将 pytorch 用于数学运算而不进行反向传播,您可以使用 with torch.no_grad() 上下文管理器,在这种情况下不会创建计算图和 torch.tensors 和 np.ndarrays 可以互换使用.

with torch.no_grad():x_t = torch.rand(3,4)y_np = np.ones((4, 2), dtype=np.float32)x_t @torch.from_numpy(y_np) #torch 中的点积np.dot(x_t.numpy(), y_np) # numpy 中相同的点积

It has been firmly established that my_tensor.detach().numpy() is the correct way to get a numpy array from a torch tensor.

I'm trying to get a better understanding of why.

In the accepted answer to the question just linked, Blupon states that:

You need to convert your tensor to another tensor that isn't requiring a gradient in addition to its actual value definition.

In the first discussion he links to, albanD states:

This is expected behavior because moving to numpy will break the graph and so no gradient will be computed.

If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.

In the second discussion he links to, apaszke writes:

Variable's can’t be transformed to numpy, because they’re wrappers around tensors that save the operation history, and numpy doesn’t have such objects. You can retrieve a tensor held by the Variable, using the .data attribute. Then, this should work: var.data.numpy().

I have studied the internal workings of PyTorch's autodifferentiation library, and I'm still confused by these answers. Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?

What is a Variable? How does it relate to a tensor?

I feel that a thorough high-quality Stack-Overflow answer that explains the reason for this to new users of PyTorch who don't yet understand autodifferentiation is called for here.

In particular, I think it would be helpful to illustrate the graph through a figure and show how the disconnection occurs in this example:

import torch

tensor1 = torch.tensor([1.0,2.0],requires_grad=True)

print(tensor1)
print(type(tensor1))

tensor1 = tensor1.numpy()

print(tensor1)
print(type(tensor1))

解决方案

I think the most crucial point to understand here is the difference between a torch.tensor and np.ndarray:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.

So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray or torch.tensor can be used interchangeably.

However, torch.tensors are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.

As mentioned before, np.ndarray object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor to np.ndarray you must explicitly remove the computational graph of the tensor using the detach() command.


Computational Graph
From your comments it seems like this concept is a bit vague. I'll try and illustrate it with a simple example.
Consider a simple function of two (vector) variables, x and w:

x = torch.rand(4, requires_grad=True)
w = torch.rand(4, requires_grad=True)

y = x @ w  # inner-product of x and w
z = y ** 2  # square the inner product

If we are only interested in the value of z, we need not worry about any graphs, we simply moving forward from the inputs, x and w, to compute y and then z.

However, what would happen if we do not care so much about the value of z, but rather want to ask the question "what is w that minimizes z for a given x"?
To answer that question, we need to compute the derivative of z w.r.t w.
How can we do that?
Using the chain rule we know that dz/dw = dz/dy * dy/dw. That is, to compute the gradient of z w.r.t w we need to move backward from z back to w computing the gradient of the operation at each step as we trace back our steps from z to w. This "path" we trace back is the computational graph of z and it tells us how to compute the derivative of z w.r.t the inputs leading to z:

z.backward()  # ask pytorch to trace back the computation of z

We can now inspect the gradient of z w.r.t w:

w.grad  # the resulting gradient of z w.r.t w
tensor([0.8010, 1.9746, 1.5904, 1.0408])

Note that this is exactly equals to

2*y*x
tensor([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=<MulBackward0>)

since dz/dy = 2*y and dy/dw = x.

Each tensor along the path stores its "contribution" to the computation:

z
tensor(1.4061, grad_fn=<PowBackward0>)

And

y
tensor(1.1858, grad_fn=<DotBackward>)

As you can see, y and z stores not only the "forward" value of <x, w> or y**2 but also the computational graph -- the grad_fn that is needed to compute the derivatives (using the chain rule) when tracing back the gradients from z (output) to w (inputs).

These grad_fn are essential components to torch.tensors and without them one cannot compute derivatives of complicated functions. However, np.ndarrays do not have this capability at all and they do not have this information.

please see this answer for more information on tracing back the derivative using backwrd() function.


Since both np.ndarray and torch.tensor has a common "layer" storing an n-d array of numbers, pytorch uses the same storage to save memory:

numpy() → numpy.ndarray
Returns self tensor as a NumPy ndarray. This tensor and the returned ndarray share the same underlying storage. Changes to self tensor will be reflected in the ndarray and vice versa.

The other direction works in the same way as well:

torch.from_numpy(ndarray) → Tensor
Creates a Tensor from a numpy.ndarray.
The returned tensor and ndarray share the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa.

Thus, when creating an np.array from torch.tensor or vice versa, both object reference the same underlying storage in memory. Since np.ndarray does not store/represent the computational graph associated with the array, this graph should be explicitly removed using detach() when sharing both numpy and torch wish to reference the same tensor.


Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensors and np.ndarrays can be used interchangeably.

with torch.no_grad():
  x_t = torch.rand(3,4)
  y_np = np.ones((4, 2), dtype=np.float32)
  x_t @ torch.from_numpy(y_np)  # dot product in torch
  np.dot(x_t.numpy(), y_np)  # the same dot product in numpy

这篇关于为什么在 Pytorch Tensor 上调用 .numpy() 之前调用 .detach() ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆