按最大值或总值归一化? [英] Normalizing by max value or by total value?

查看:102
本文介绍了按最大值或总值归一化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一些涉及文档比较的工作.为此,我正在分析每个文档,并基本上计算每个文档中某些关键词出现的次数.例如:

I'm doing some work that involves document comparison. To do this, I'm analizing each document, and basically counting the number of times some key words appear on each of these documents. For instance:

Document 1:                          Document 2:
    Book   -> 3                          Book   -> 9
    Work   -> 0                          Work   -> 2
    Dollar -> 5                          Dollar -> 1
    City   -> 18                         City   -> 6

所以在计数过程之后,我将所有这些数字序列存储在一个向量中.这个数字序列将代表每个文档的特征向量.

So after the counting process, I store all these sequence of numbers in a vector. This sequence of numbers will represent the feature vector for each document.

Document 1: [ 3,  0,  5, 18]
Document 2: [ 9,  2,  1,  6]

最后一步是对 [0 1] 范围内的数据进行标准化.但在这里我意识到这可以通过两种不同的方法来完成:

The final step would be to normalize the data in a range from [0 1]. But here is where I realized this could be done following two different approachs:

  1. 将每个数字序列除以重复总数
  2. 将每个数字序列除以最大重复次数

按照第一种方法,归一化的结果是:

Following the first approach, the result of the normalization would be:

Document 1: [ 0.11538,  0.00000,  0.19231, 0.69231]   (divided by 26)
Document 2: [ 0.50000,  0.11111,  0.05556, 0.33333]   (divided by 18)

虽然采用第二种方法,但结果是:

While following the second approach, the result would be:

Document 1: [ 0.16667,  0.00000,  0.27778, 1.00000]   (divided by 18)
Document 2: [ 1.00000,  0.22222,  0.11111, 0.66667]   (divided by  9)

对于这种特定情况:

  • 这两种方法中哪一种会增强特征向量的表示和比较?
  • 结果会一样吗?
  • 如果使用特定的相似性度量(欧几里得、余弦),这些方法中的任何一种会更有效吗?

推荐答案

Notation

假设您有两个向量 AB,您使用 x 作为 Ay 作为 B 的归一化常数.由于您正在计算单词出现次数,我们可以假设 x >0y >0.

Notation

Suppose you have two vectors A and B, you use x as the normalization constant for A and y as the normalization constant for B. Since you are counting word occurrences, we can assume x > 0 and y > 0.

对于下面显示的余弦距离,将抵消归一化常数.很容易看出,你最终会在枚举器处得到一个常量 1/(xy) ,在分母处得到一个相同的常量 1/(xy) .所以你可以取消1/(xy).

For cosine distance showing below, normalization constant will be canceled out. It's easy to see, you will finally get a constant 1/(xy) at the enumerator, and an identical constant 1/(xy) at the denominator . So you can cancel out 1/(xy).

对于欧几里得距离,不是上面的情况.我在下面列出了一个例子,假设 AB 是二维向量.n 维向量只是一个简单的扩展.A'B' 分别是 AB 的归一化向量.

For Euclidean distance, it's not the case above. I list an example below assuming A and B are 2-d vectors. n-dimensional vector is just a simple extension on that. A' and B' are the normalized vector of A and B respectively.

dist(A,B) 的非规范化版本与 dist(A',B') 的规范化版本进行比较,可以看到:规范化常数您选择的(最大值或总和)决定了 x1^2+x2^2y1^2+y2^2 和交互项的权重.因此,不同的归一化常数会给你不同的距离.

Comparing the unnormalized version of dist(A,B) with the normalized version of dist(A',B'), you can see that: the normalization constant you choose (max or sum) determines the weight on x1^2+x2^2, y1^2+y2^2 and the interacting term. As a result, different normalization constants give you different distances.

如果这是为了某些信息检索目的或主题提取,您是否尝试过TF-IDF?这可能比单纯计算术语的出现次数更好.

If this is for some information retrieval purpose or topic extracting, did you try TF-IDF? That might be a better measure than purely counting the occurrences of terms.

这篇关于按最大值或总值归一化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆