MySQL查询倒排索引数据 [英] Mysql query of inverted index data

查看:893
本文介绍了MySQL查询倒排索引数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网站上有数千个页面,我将其解析并存储为倒排索引,即

I have thousand of pages in website which I parsed and stored it as Inverted Index viz

文档

  • 医生(PK,FK)
  • 网址
  • 字符数
  • wordcount
  • docid (PK,FK)
  • url
  • charactercount
  • wordcount

字符计数和单词计数可以帮助我根据简短信息确定较长的文档,以后可以使用.

Charactercount and wordcount helps me determine long document from short which I may use later.

单词

  • wordid(PK,FK)
  • 单词
  • doc_freq
  • inverse_doc_freq
  • wordid (PK,FK)
  • word
  • doc_freq
  • inverse_doc_freq

对于inverse_doc_freq计算,我使用虚构的高数字(100000000)来防止重新计算文档总数.

For inverse_doc_freq calculation I use fictional high number (100000000) to prevent total document recalculation.

查找

  • wordid
  • 医生
  • word_freq
  • 体重
  • wordid
  • docid
  • word_freq
  • weight

(wordid和docid组合在一起唯一)

(wordid & docid combined unique)

权重是简单计算的分数,例如标题中的单词+网址中的单词+单词的频率等.

The weight is a score calculated on simple basis like word in title + word in url + word frquency etc.

我在对查询词的sql查询进行构架时遇到问题.对于3个字的搜索,我正在做

I am having problem framing my sql query for search words. For 3 word search I am doing like

  1. 中断查询到每个单词
  2. 检查每个单词的inverse_doc_freq并删除IDF低的单词(去除停用词)
  3. 保留其余单词(假设还剩下3个单词)
  4. 查询每个词

在第4阶段,我被卡住了!我的SQL查询就像

It is at stage 4 that I am getting stuck ! My SQL query is like

SELECT d.docid,url,inverse_doc_freq,word_freq,weight from document d,word w,loc l WHERE d.docid=l.docid AND w.wordid=l.wordid AND (word='word1' OR word='word2' OR word='word3') ORDER BY weight DESC

但是返回的文档不正确.相信我可能必须三次搜索才能找到每个单词的文档,然后尝试查找通用文档,但是怎么办?是否可以仅使用1个MySQL查询?还可以使用 TF-IDF 以及如何使用?

The returned documents are not correct though. Trust I might have to Search thrice to find documents for each word and then try to find the common documents, but how ? Is it possible to use only 1 MySQL query for it ? Also is it possible to use TF-IDF and how ?

推荐答案

您需要在文档级别进行汇总.

You need to aggregate at the document level.

select d.docid, d.url, sum(weight) as weight
from document d join
     loc l
     on d.docid = l.docid join
     word w
     on w.wordid = l.wordid
where w.word in ('word1', 'word2', 'word3')
group by d.docid
order by weight DESC;

这篇关于MySQL查询倒排索引数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆