MySQL查询倒排索引数据 [英] Mysql query of inverted index data
问题描述
我在网站上有数千个页面,我将其解析并存储为倒排索引,即
I have thousand of pages in website which I parsed and stored it as Inverted Index viz
文档
- 医生(PK,FK)
- 网址
- 字符数
- wordcount
- docid (PK,FK)
- url
- charactercount
- wordcount
字符计数和单词计数可以帮助我根据简短信息确定较长的文档,以后可以使用.
Charactercount and wordcount helps me determine long document from short which I may use later.
单词
- wordid(PK,FK)
- 单词
- doc_freq
- inverse_doc_freq
- wordid (PK,FK)
- word
- doc_freq
- inverse_doc_freq
对于inverse_doc_freq计算,我使用虚构的高数字(100000000)来防止重新计算文档总数.
For inverse_doc_freq calculation I use fictional high number (100000000) to prevent total document recalculation.
查找
- wordid
- 医生
- word_freq
- 体重
- wordid
- docid
- word_freq
- weight
(wordid和docid组合在一起唯一)
(wordid & docid combined unique)
权重是简单计算的分数,例如标题中的单词+网址中的单词+单词的频率等.
The weight is a score calculated on simple basis like word in title + word in url + word frquency etc.
我在对查询词的sql查询进行构架时遇到问题.对于3个字的搜索,我正在做
I am having problem framing my sql query for search words. For 3 word search I am doing like
- 中断查询到每个单词
- 检查每个单词的inverse_doc_freq并删除IDF低的单词(去除停用词)
- 保留其余单词(假设还剩下3个单词)
- 查询每个词
在第4阶段,我被卡住了!我的SQL查询就像
It is at stage 4 that I am getting stuck ! My SQL query is like
SELECT d.docid,url,inverse_doc_freq,word_freq,weight from document d,word w,loc l WHERE d.docid=l.docid AND w.wordid=l.wordid AND (word='word1' OR word='word2' OR word='word3') ORDER BY weight DESC
但是返回的文档不正确.相信我可能必须三次搜索才能找到每个单词的文档,然后尝试查找通用文档,但是怎么办?是否可以仅使用1个MySQL查询?还可以使用 TF-IDF 以及如何使用?
The returned documents are not correct though. Trust I might have to Search thrice to find documents for each word and then try to find the common documents, but how ? Is it possible to use only 1 MySQL query for it ? Also is it possible to use TF-IDF and how ?
推荐答案
您需要在文档级别进行汇总.
You need to aggregate at the document level.
select d.docid, d.url, sum(weight) as weight
from document d join
loc l
on d.docid = l.docid join
word w
on w.wordid = l.wordid
where w.word in ('word1', 'word2', 'word3')
group by d.docid
order by weight DESC;
这篇关于MySQL查询倒排索引数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!