用于标记、云和搜索的最佳数据架构(如 StackOverflow)? [英] Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

查看:11
本文介绍了用于标记、云和搜索的最佳数据架构(如 StackOverflow)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很想知道 Stack Overflow 的标记和搜索是如何构建的,因为它似乎工作得很好.

I'd love to know how Stack Overflow's tagging and search is architected, because it seems to work pretty well.

如果我想做以下所有事情,什么是好的数据库/搜索模型:

What is a good database/search model if I want to do all of the following:

  1. 在各种实体上存储标签(标准化程度如何?即实体、标签和 Entity_Tag 表?)
    • 搜索带有特定标签的项目
    • 构建适用于特定搜索结果集的所有标签的标签云
    • 如何在搜索结果中显示每个项目的标签列表?

也许以标准化形式存储标签是有意义的,但也可以作为空格分隔的字符串用于#2、#4 和#3.想法?

Perhaps it makes sense to store the tags in a normalized form, but also as a space-delimited string for the purposes of #2, #4, and perhaps #3. Thoughts?

我听说 Stack Overflow 使用 Lucene 进行搜索.真的吗?我听过一些讨论 SQL 优化的播客,但没有关于 Lucene 的内容.如果他们确实使用 Lucene,我想知道有多少搜索结果来自 Lucene,以及向下钻取"标签云是否来自 Lucene.

I have heard it said that Stack Overflow uses Lucene for search. Is that true? I've heard a couple of podcasts discussing SQL optimization, but nothing about Lucene. If they do use Lucene, I'm wondering how much of the search result comes from Lucene, and whether the "drill-down" tag cloud comes from Lucene.

推荐答案

哇,我刚写了一个大帖子,就这么噎住了,当我点击返回按钮重新提交时,标记编辑器是空的.啊.

Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

所以我又来了……

关于 Stack Overflow,事实证明他们使用 SQL server 2005全文搜索.

Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

关于@Grant 推荐的操作系统项目:

Regarding the OS projects recommended by @Grant:

  • *DotNetKicks 使用 DB 进行标记,使用 Lucene 进行全文搜索.似乎没有办法将全文搜索与标签搜索结合起来
  • Kigg 使用 Linq-to-SQL 进行搜索和标记查询.两个查询都加入 Stories->StoryTags->Tags.
  • 这两个项目都有一个 3-table 方法来标记,因为每个人似乎都普遍推荐
  • *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
  • Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
  • Both projects have a 3-table approach to tagging as everyone generally seems to recommend

我还发现了一些我之前错过的关于 SO 的其他问题:

I also found some other questions on SO that I'd missed before:

我目前对我提到的每个项目都在做什么:

What I'm currently doing for each of the items I mentioned:

  1. 在 DB 中,有 3 个表:Entity、Tag、Entity_Tag.我使用数据库:
    • 构建站点范围的标签云
    • 按标签浏览(即像 SO 的 /questions/tagged/ASP.NET 这样的 URL)
  1. In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
    • Build site-wide tag clouds
    • browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
  • 标签被连接成一个由 Lucene 索引的 TagString
    • 所以我拥有 Lucene 查询引擎的全部功能(AND/OR/NOT 查询)
    • 我可以同时搜索文本标签过滤
    • Lucene 分析器会合并单词以便更好地进行标签搜索(即test"的标签搜索也会找到标记为testing"的内容)

    这意味着每当实体的标签被修改时,我必须:

    This means that whenever an Entity's tags are modified, I have to:

    • 插入任何尚不存在的新标签
    • 从 EntityTag 表中插入/删除
    • 更新 Entity.TagString
    • 更新实体的 Lucene 索引

    鉴于在我的应用程序中读取与写入的比率非常大,我认为我可以接受.唯一真正耗时的部分是 Lucene 索引,因为 Lucene 只能从其索引中插入删除,所以我必须重新索引整个实体才能更新标记字符串.我对此并不感到兴奋,但我认为如果我在后台线程中执行此操作,那就没问题了.

    Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.

    时间会证明一切的......

    Time will tell...

    这篇关于用于标记、云和搜索的最佳数据架构(如 StackOverflow)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆