Libpuzzle索引数百万张图片? [英] Libpuzzle Indexing millions of pictures?

查看:151
本文介绍了Libpuzzle索引数百万张图片?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于libp的libpuzzle库文件( http://libpuzzle.pureftpd.org/project/libpuzzle )弗兰克·丹尼斯先生。我想了解如何索引和存储在我的mysql数据库中的数据。载体的产生绝对没有问题。



示例:

 #计算两个图像的签名
$ cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$ cvec2 = puzzle_fill_cvec_from_file('img2.jpg');

#计算两个签名之间的距离
$ d = puzzle_vector_normalized_distance($ cvec1,$ cvec2);

#图片是否相似?
if($ d< PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD){
echo图片看起来相似\\\
;
} else {
echo图片不同,distance = $ d\\\
;
}

这一切都清楚了 - 但现在我如何工作,大量图片> 1.000.000?我计算向量并将其与文件名存储在数据库中?如何找到类似的图片现在?如果我存储每个向量在mysql我必须打开每个记录和计算与puzzle_vector_normalized_distance函数的距离。该过程需要很多时间(打开每个数据库条目 - 将其抛出函数,...)



我从lib拼图libaray读取自述文件,发现以下内容:


它能够用于拥有数百万张图片的数据库吗?



典型的映像签名只需要182字节,使用内置的
压缩/解压缩函数。



类似的签名共享相同的字。在相同位置的
值的相同序列。通过使用复合索引(字+
位置),可能的相似向量的集合显着地
减少,并且在大多数情况下,实际上不需要向量距离计算



通过单词和位置索引也可以很容易将
数据拆分为多个表和服务器。



是的,拼图库肯定不是与
项目,需要索引数以百万计的图片不兼容。


关于索引的说明:


< - p> ------------------------ INDEXING ------------------------



如何快速找到类似的图片,如果他们数百万条记录?



原始纸张有一个简单而有效的答案。



长字。例如,让我们考虑下面的向量



[abcdefghijklmnopqrstu vwxyz]



长度(K)为10,您可以得到以下词:



在位置0处找到的[abcdefghij] [bcdefghijk]
在位置1找到[cdefghijkl]在位置2等找到



然后,用(字+位置)的复合索引对您的向量索引。



即使有数百万的图像,K = 10和N = 100应该足够了
有很少的条目共享相同的索引。



这是一个非常基本的示例数据库模式:




  + --- -------------------------- + 
|签名|
+ ----------------------------- +
| sig_id |签名| pic_id |
+ -------- + ----------- + -------- +

+ ------ -------------------- +
|字|
+ -------------------------- +
| pos_and_word | fk_sig_id |
+ -------------- + ----------- +




我建议至少将words表分成多个
表和/或服务器。



默认情况下(lambas = 9)签名是544字节长。为了节省
存储空间,它们可以通过puzzle_compress_cvec()函数压缩到原始
大小的1/3。在使用之前,他们
必须解压缩与puzzle_uncompress_cvec()。


我认为压缩是错误的方式,在比较之前必须解压缩每个向量。



我的问题是 - 如何处理数百万张图片,以及如何以快速有效的方式比较它们。我不能理解切割矢量应该如何帮助我的问题。



非常感谢 - 也许我可以找到一个人在这里工作libpuzzle libaray。



干杯。

解决方案



让我们假设你有一个存储每个图像相关信息的表格(路径,名称,描述等)。在该表中,您将包括压缩签名的字段,在最初填充数据库时计算和存储。让我们定义那个表格:

  CREATE TABLE images(
image_id INTEGER NOT NULL PRIMARY KEY,
name TEXT,
description TEXT,
file_path TEXT NOT NULL,
url_path TEXT NOT NULL,
signature TEXT NOT NULL
);

最初计算签名时,还要计算签名中的多个字:

  //这将为每个图像运行一次:
$ cvec = puzzle_fill_cvec_from_file('img1.jpg' );
$ words = array();
$ wordlen = 10; // this is $ k from the example
$ wordcnt = 100; // this is $ n from the example
for($ i = 0; $ i< min($ wordcnt,strlen($ cvec) - $ wordlen + 1); $ i ++){
$ words [] = substr($ cvec,$ i,$ wordlen);
}

现在你可以把这些单词放到一个表格中,定义如下:

  CREATE TABLE img_sig_words(
image_id INTEGER NOT NULL,
sig_word TEXT NOT NULL,
FOREIGN KEY image_id)REFERENCES images(image_id),
INDEX(image_id,sig_word)
);

现在,您插入该表,在找到单词的位置索引前面,知道一个词在签名中的匹配位置是否匹配:

  //签名以及所有其他数据,已经插入到图像
//表中,并且$ image_id已经用生成的主键填充
foreach($ words as $ index => $ word){
$ sig_word = $ index .'__'。$ word;
$ dbobj-> query(INSERT INTO img_sig_words(image_id,sig_word)VALUES($ image_id,
'$ sig_word')); // figure a appropriate defined db abstraction layer ...
}

,您可以相对容易地抓取具有匹配单词的图片:

  // $ image_id设置为您尝试的基础映像查找匹配
$ dbobj->查询(SELECT i。*,COUNT(isw.sig_word)as strength FROM images i JOIN img_sig_words
isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
isw_search.sig_word AND isw.image_id!= isw_search.image_id WHERE
isw_search.image_id = $ image_id GROUP BY i.image_id,i.name,i.description,
i .file_path,i.url_path,i.signature ORDER BY strength DESC);

您可以通过添加 HAVING 强度,从而进一步减少您的匹配集。



我不保证这是



基本上,以这种方式分割和存储单词可以让你做一个粗糙的距离检查,而不必在签名上运行专门的功能。


its about the libpuzzle libray for php ( http://libpuzzle.pureftpd.org/project/libpuzzle ) from Mr. Frank Denis. I´am trying to understand how to index and store the data in my mysql database. The generation of the vector is absolutly no problem.

Example:

# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');

# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);

# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
  echo "Pictures are looking similar\n";
} else {
  echo "Pictures are different, distance=$d\n";
}

Thats all clear to me - but now how do i work when i have a big amount of pictures >1.000.000? I calculate the vector and store it with the filename in the database? How to find the similar pictures now? If i store every vector in the mysql i have to open each record and calculate the distance with the puzzle_vector_normalized_distance function. That procedures takes alot of time (open each database entry - put it throw the function ,...)

I read the readme from the lib puzzle libaray and found the following:

Will it work with a database that has millions of pictures?

A typical image signature only requires 182 bytes, using the built-in compression/decompression functions.

Similar signatures share identical "words", ie. identical sequences of values at the same positions. By using compound indexes (word + position), the set of possible similar vectors is dramatically reduced, and in most cases, no vector distance actually requires to get computed.

Indexing through words and positions also makes it easy to split the data into multiple tables and servers.

So yes, the Puzzle library is certainely not incompatible with projects that need to index millions of pictures.

Also i found this description about indexing:

------------------------ INDEXING ------------------------

How to quickly find similar pictures, if they are millions of records?

The original paper has a simple, yet efficient answer.

Cut the vector in fixed-length words. For instance, let's consider the following vector:

[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]

With a word length (K) of 10, you can get the following words:

[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1

Then, index your vector with a compound index of (word + position).

Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.

Here's a very basic sample database schema:

+-----------------------------+
| signatures |
+-----------------------------+
| sig_id | signature | pic_id |
+--------+-----------+--------+

+--------------------------+
| words |
+--------------------------+
| pos_and_word | fk_sig_id |
+--------------+-----------+

I'd recommend splitting at least the "words" table into multiple tables and/or servers.

By default (lambas=9) signatures are 544 bytes long. In order to save storage space, they can be compressed to 1/third of their original size through the puzzle_compress_cvec() function. Before use, they must be uncompressed with puzzle_uncompress_cvec().

I think that compressing is the wrong way cause then i have to uncompress every vector before comparing it.

My question is now - whats the way to handle millions of pictures and how to compare them in a fast and efficient way. I cant understand how the "cutting of the vector" should help me with my problem.

Many thanks - maybe i can find someone here which is working with the libpuzzle libaray.

Cheers.

解决方案

So, let's take a look at the example they give and try to expand.

Let's assume you have a table that stores information relating to each image (path, name, description, etc). In that table, you'll include a field for the compressed signature, calculated and stored when you initially populate the database. Let's define that table thus:

CREATE TABLE images (
    image_id INTEGER NOT NULL PRIMARY KEY,
    name TEXT,
    description TEXT,
    file_path TEXT NOT NULL,
    url_path TEXT NOT NULL,
    signature TEXT NOT NULL
);

When you initially compute the signature, you're also going to compute a number of words from the signature:

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
    $words[] = substr($cvec, $i, $wordlen);
}

Now you can put those words into a table, defined thus:

CREATE TABLE img_sig_words (
    image_id INTEGER NOT NULL,
    sig_word TEXT NOT NULL,
    FOREIGN KEY (image_id) REFERENCES images (image_id),
    INDEX (image_id, sig_word)
);

Now you insert into that table, prepending the position index of where the word was found, so that you know when a word matches that it matched in the same place in the signature:

// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
    $sig_word = $index.'__'.$word;
    $dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
        '$sig_word')"); // figure a suitably defined db abstraction layer...
}

Your data thus initialized, you can grab images with matching words relatively easily:

// $image_id is set to the base image that you are trying to find matches to
$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words
    isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
    isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE
    isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,
    i.file_path, i.url_path, i.signature ORDER BY strength DESC");

You could improve the query by adding a HAVING clause that requires a minimum strength, thus further reducing your matching set.

I make no guarantees that this is the most efficient setup, but it should be roughly functional to accomplish what you're looking for.

Basically, splitting and storing the words in this manner allows you to do a rough distance check without having to run a specialized function on the signatures.

这篇关于Libpuzzle索引数百万张图片?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆