梯度下降法在特征向量袋词分类任务中的应用 [英] gradient descent as applied to feature vector bag of words classification task

查看:82
本文介绍了梯度下降法在特征向量袋词分类任务中的应用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看过> 吴安国(Andrew Ng)视频 一遍又一遍,但我仍然不明白如何将梯度下降应用于我的问题.

I've watched the Andrew Ng videos over and over and still I don't understand how to apply gradient descent to my problem.

他几乎只从事高级概念解释领域的工作,但我需要的是战术方面的基础知识.

He deals pretty much exclusively in the realm of high level conceptual explanations but what I need are ground level tactical insights.

我的输入是以下形式的特征向量:

My input are feature vectors of the form:

示例:

Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]

字典是:

["I", "am", "awesome", "great"]

因此,文档作为矢量将如下所示:

So the documents as a vector would look like:

Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]

根据我所见,梯度下降算法看起来像这样:

According to what I've seen the algorithm for gradient descent looks like this:

据我目前的了解,α是学习率,x (i)是一个功能,在上面的示例中,对于Document 2,x (3) = 2.

It is my current understanding that α is the learning rate, x(i) is a feature, in the above example for Document 2, x(3)=2.

y (i)是标签,在我的情况下,我试图预测与特定特征向量相关的Document,例如y (0)Document 1相关联,& y (1)表示Document 2.

y(i) is the label, in my case I'm trying to predict the Document associated with a particular feature vector so for instance y(0) would be associated with Document 1, & y(1) would represent Document 2.

可能会有很多文档,比如说10个,所以我可以有5个与y (0)相关的文档和5个与y (1)相关的文档,在这种情况下m = 10.

There will be potentially many documents, let's say 10, so I could have 5 docuements associated with y(0) and 5 documents associated with y(1), in such case m = 10.

我真正不了解的第一件事是Θ 0 &的作用是什么. Θ 1 ?

The first thing I don't really understand is, what is the role of Θ0 & Θ1?

我想像它们是权重值,就像感知器算法一样,我将它们应用于要素的值,以试图哄骗该要素(无论其固有值如何)以输出与之相关的标签的值它是关联的.那是对的吗? 因此,我一直将Θ等同于值与感知器的重量值是否准确?

I suppose that they are the weight values, as with the perceptron algorithm, I apply them to the value of the feature in an effort to coax that feature, regardless of its inherent value, to output the value of the label with which it is associated. Is that correct? So I've been equating the Θ values with the weight values of perceptron, is this accurate?

此外,我不了解我们要采用的梯度.我真的不在乎听到关于在山上行走的另一种高级解释,实际上,对于我上面刚刚详细介绍过的情况,我们采用的梯度是什么?两次后续迭代中的权重?功能的价值及其真实标签?

感谢您的考虑,我们将不胜感激.

Thank you for your consideration, any insight would be greatly appreciated.

推荐答案

他几乎只涉及高级概念解释领域,但我需要的是战术基础上的见识.

He deals pretty much exclusively in the realm of high level conceptual explanations but what I need are ground level tactical insights.

我发现他的视频是最实用,最基础"的,尤其是因为您还可以查看代码.你看过吗?

I found his videos the most practical and "ground level", especially since there is also code you can look at. Have you looked at it?

我目前的理解是,α是学习率,x(i)是一个特征,在上面的文档2的示例中,x(3)= 2.

It is my current understanding that α is the learning rate, x(i) is a feature, in the above example for Document 2, x(3)=2.

更正α,对x(i)错:x(i)实例样本.在您的示例中,您有:

Correct about α, wrong about x(i): x(i) is an instance or a sample. In your example, you have:

Document 1 = [1, 1, 1, 0] = x(1)
Document 2 = [1, 1, 0, 2] = x(2)

例如,功能将是x(1, 2) = 1.

y(i)是标签,在我的情况下,我试图预测与特定特征向量相关联的Document,因此,例如y(0)将与Document 1相关联,& y(1)代表文档2.

y(i) is the label, in my case I'm trying to predict the Document associated with a particular feature vector so for instance y(0) would be associated with Document 1, & y(1) would represent Document 2.

正确.尽管我相信Andrew Ng的讲授使用基于1的索引,所以应该是y(1)y(2).

Correct. Although I believe Andrew Ng's lectures use 1-based indexing, so that would be y(1) and y(2).

可能会有很多文档,比如说10个,因此我可能有5个与y(0)相关的文档和5个与y(1)相关的文档,在这种情况下,m = 10.

There will be potentially many documents, let's say 10, so I could have 5 docuements associated with y(0) and 5 documents associated with y(1), in such case m = 10.

那不是你应该怎么看的.每个文档将具有其自己的标签(y值).标签之间是否相等是另一回事.文档1将具有标签y(1),文档5将具有标签y(5).到目前为止y(1) == y(5)是否无关紧要.

That's not how you should look at it. Each document will have its own label (an y value). Whether or not the labels are equal among them is another story. Document 1 will have label y(1) and document 5 will have label y(5). Whether or not y(1) == y(5) is irrelevant so far.

我真正不了解的第一件事是Θ0和? Θ1?

The first thing I don't really understand is, what is the role of Θ0 & Θ1?

Theta0Theta1代表您的模型,这是您用来预测标签的内容:

Theta0 and Theta1 represent your model, which is the thing you use to predict your labels:

prediction = Theta * input
           = Theta0 * input(0) + Theta1 * input(1)

其中input(i)是要素的值,而input(0)通常被定义为始终等于1.

Where input(i) is the value of a feature, and input(0) is usually defined as always being equal to 1.

当然,由于您具有多个功能,因此将需要两个以上的Theta值.吴安德(Andrew Ng)在介绍您发布的公式之后,继续在讲座中推广此过程,以提供更多功能.

Of course, since you have more than one feature, you will need more than two Theta values. Andrew Ng goes on to generalize this process for more features in the lectures following the one where he presents the formula you posted.

我想像它们是权重值,就像感知器算法一样,我将它们应用于要素的值,以试图哄骗该要素(无论其固有值如何)以输出与之相关的标签的值它是关联的.那是对的吗?所以我一直把Θ值和感知器的权重值相等,这样准确吗?

I suppose that they are the weight values, as with the perceptron algorithm, I apply them to the value of the feature in an effort to coax that feature, regardless of its inherent value, to output the value of the label with which it is associated. Is that correct? So I've been equating the Θ values with the weight values of perceptron, is this accurate?

是的,这是正确的.

此外,我不了解我们要采用的梯度.我真的不在乎听到关于在山上行走的另一种高级解释,实际上,对于我上面刚刚详细介绍过的情况,我们采用的梯度是什么?两次后续迭代中的权重?功能的价值和它的真实标签?

Moreover I don't understand what we're taking the gradient of. I really don't care to hear another high level explaination about walking on hills and whatnot, practically speaking, for the situation I've just detailed above, what are we taking the gradient of? Weights in two subsequent iterations? The value of a feature and it's true label?

首先,您知道什么是梯度吗?它基本上是一类偏导数的数组,因此更容易解释我们对和取什么的偏导数.

First of all, do you know what a gradient is? It's basically an array of partial derivatives, so it's easier to explain what we're taking the partial derivative of and with respect to what.

我们正在针对每个Theta值采用成本函数的偏导数(在Andrew Ng的演讲中定义为差平方).所有这些偏导数构成了梯度.

We are taking the partial derivative of the cost function (defined in Andrew Ng's lecture as the difference squared) with respect to each Theta value. All of these partial derivatives make up the gradient.

我真的不知道如何更实际地解释它.与您列出的内容最接近的是要素的值及其真实标签",因为成本函数告诉我们模型的质量,而其相对于每个要素权重的偏导数则告诉我们每种模型的不良程度体重到目前为止.

I really don't know how to explain it more practically. The closest from what you listed would be "the value of a feature and its true label", because the cost function tells us how good our model is, and its partial derivatives with respect to the weight of each feature kinda tell us how bad each weight is so far.

您似乎再次混淆了功能和示例.要素没有标签.样本或实例具有标签.样本或实例还功能组成.

You seem to be confusing features and samples again. A feature does not have labels. Samples or instances have labels. Samples or instances also consist of features.

这篇关于梯度下降法在特征向量袋词分类任务中的应用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆