加速文本比较(特征向量)与空间MySQL功能 [英] Speed up text comparisons (feature vectors) with spatial MySQL features

查看:520
本文介绍了加速文本比较(特征向量)与空间MySQL功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个函数,它接受两个数组包含两个文本的标记/单词,并给出余弦相似性值,显示两个文本之间的关系。



函数接受数组$ tokensA(0 => house,1 => bike,2 => man)和数组$ tokensB(0 => bike,1 => house,2 => car)作为浮点值。

  function cosineSimilarity($ tokensA,$ tokensB){
$ a = $ b = $ c = 0;
$ uniqueTokensA = $ uniqueTokensB = array();
$ uniqueMergedTokens = array_unique(array_merge($ tokensA,$ tokensB));
foreach($ tokensA as $ token)$ uniqueTokensA [$ token] = 0;
foreach($ tokensB as $ token)$ uniqueTokensB [$ token] = 0;
foreach($ uniqueMergedTokens as $ token){
$ x = isset($ uniqueTokensA [$ token])? 1:0;
$ y = isset($ uniqueTokensB [$ token])? 1:0;
$ a + = $ x * $ y;
$ b + = $ x;
$ c + = $ y;
}
return $ b * $ c!= 0? $ a / sqrt($ b * $ c):0;
}

如果我想比较75个文本,我需要做5,625



是否可以使用MySQL的空间列来减少比较次数?



我不想谈论我的功能或比较文本的方式。



MySQL的空间列




  • 您可以使用以下方式创建空间列:CREATE TABLE abc(clmnName TYPE)

  • 列出可能的类型此处

  • 这里是我如何选择数据[例如MultiPointFromText()或AsText()]

  • 您插入如下值:INSERT INTO clmnName VALUES(GeomFromText('POINT(11)'))



但是如何使用这个来解决我的问题?



PS:我在寻找方法来减少与算法的比较次数此问题。 Vinkal Vrsalovic告诉我,我应该为空间特征打开另一个问题。

解决方案 通常可以索引具有任意维数的数据, MySQL 空间能力仅限于几何 2

c $ c> -dimensional ,您可以对其进行规范化,然后执行以下操作:




  • 符合你的差异的角度数量的两倍

  • 找到与每个扇区中心具有给定余弦差的向量的 MBR li>
  • 查找 MBR
  • 中的所有向量
  • 对精确的差异进行精细过滤。



然而,在这种情况下,最好只是预先计算值的角度,并用平滑的 B-Tree 索引。


I have a function which takes two arrays containing the tokens/words of two texts and gives out the cosine similarity value which shows the relationship between both texts.

The function takes an array $tokensA (0=>house, 1=>bike, 2=>man) and an array $tokensB (0=>bike, 1=>house, 2=>car) and calculates the similarity which is given back as a floating point value.

function cosineSimilarity($tokensA, $tokensB) {
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();
    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.

Is it possible to use MySQL's spatial columns to reduce the number of comparisons?

I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.

MySQL's spatial columns

  • You create spatial columns with: CREATE TABLE abc (clmnName TYPE)
  • possible types are listed here
  • here is how I select the data later [e.g. MultiPointFromText() or AsText()]
  • You insert values like this: INSERT INTO clmnName VALUES (GeomFromText('POINT(1 1)'))

But how do you use this for my problem?

PS: I'm looking for ways to reduce the number of comparisons with algorithms in this question. Vinko Vrsalovic told me that I should open another question for the spatial features.

解决方案

While R-Trees in general can index data with arbitrary number of dimensions, MySQL spatial abilities are only limited to Geometry types (2 dimensions).

If your vectors are 2-dimensional and you can normalize them, then do the following:

  • Split the circle into twice the number of angles which fit your differences
  • Find the MBR of vectors with given cosine difference from the center of each sector
  • Find all vectors within the MBR
  • Do the fine filtering for exact difference.

In this case, however, it will be better just to precaculate the angle of the value and index it with a plain B-Tree index.

这篇关于加速文本比较(特征向量)与空间MySQL功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆