用于标记,云和搜索的最佳数据架构(如StackOverflow)? [英] Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

查看:173
本文介绍了用于标记,云和搜索的最佳数据架构(如StackOverflow)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很想知道Stack Overflow的标签和搜索是如何构建的,因为它似乎工作得很好。



如果我想要执行以下所有操作,那么什么是好的数据库/搜索模型:


  1. 在各种实体上存储标签(如何规范化,即Entity,Tag和Entity_Tag表?)


    • 搜索具有特定标签的项目

    • 构建适用于特定搜索结果集的所有标签的标签云

    • 如何显示标签列表搜索结果中的每个项目?


也许存储标签以标准化的形式,但也作为空格分隔的字符串,用于#2,#4和可能#3。想法?



我听说过Stack Overflow使用Lucene进行搜索。真的吗?我听说过几个讨论SQL优化的播客,但没有关于Lucene。如果他们使用Lucene,我想知道有多少搜索结果来自Lucene,以及下钻标签云是否来自Lucene。

解决方案

哇,我刚刚写了一个大帖子,SO cho cho and and and,,,,,,。。。。。,。。,,,,,,,,,,。 aaargh。



所以这里我再次去...



关于堆栈溢出,事实证明,他们使用 SQL server 2005全文搜索



关于@Grant推荐的操作系统项目:




  • DotNetKicks 使用DB进行标记和Lucene进行全文搜索。似乎无法将全文搜索与标签搜索结合使用。

  • Kigg 对搜索和标签查询使用Linq-to-SQL。这两个查询都加入了Stories-> StoryTags->标签。

  • 这两个项目都有一个3表格的方法来标记,因为每个人都喜欢推荐



我还发现我以前想过的一些其他问题:





我目前正在为每个提到的项目做些什么:


  1. 在DB中,3个表:Entity,Tag,Entity_Tag。我使用数据库:


    • 构建站点范围的标签云

    • 按标签浏览(即像SO的 /questions/tagged/ASP.NET


  2. 对于搜索,我使用Lucene + NHibernate.Search


    • 标签连接到由Lucene索引的TagString


      • 所以我有Lucene查询引擎的全部功能(AND / OR / NOT queries)

      • 我可以同时按标签过滤
      • Lucene分析器将字符合并到更好的标签搜索中(即,标签搜索测试也会找到标记为测试的东西)


    • Lucene返回一个潜在的巨大结果集,我分页到20个结果

    • 然后NHibernate加载结果Ent id由ID或者从DB或实体缓存

    • 所以完全有可能搜索结果是0到达数据库


  3. 不这样做但是,我想我可能会尝试从Lucene的TagString中找到一个构建标签云的方法,而不是采用另一个数据库命中。

  4. 还没有这样做,但是我可能会将TagString存储在数据库中,以便我可以显示一个实体的标签列表,而无需再进行两次连接。

意味着每当实体的标签被修改时,我必须:




  • 插入任何不存在的新标签

  • 从EntityTag表插入/删除

  • 更新Entity.TagString

  • 更新实体的Lucene索引



鉴于我的应用程序读取与写入的比例非常大,我认为我可以这样做。唯一真正耗时的部分是Lucene索引,因为Lucene只能从其索引中插入和删除,所以我必须重新索引整个实体才能更新TagString。我不是很兴奋,但我认为,如果我在后台线程中这样做,这将是罚款。



时间会告诉...


I'd love to know how Stack Overflow's tagging and search is architected, because it seems to work pretty well.

What is a good database/search model if I want to do all of the following:

  1. Storing Tags on various entities, (how normalized? i.e. Entity, Tag, and Entity_Tag tables?)
    • Searching for items with particular tags
    • Building a tag cloud of all tags that apply to a particular search result set
    • How to show a tag list for each item in a search result?

Perhaps it makes sense to store the tags in a normalized form, but also as a space-delimited string for the purposes of #2, #4, and perhaps #3. Thoughts?

I have heard it said that Stack Overflow uses Lucene for search. Is that true? I've heard a couple of podcasts discussing SQL optimization, but nothing about Lucene. If they do use Lucene, I'm wondering how much of the search result comes from Lucene, and whether the "drill-down" tag cloud comes from Lucene.

解决方案

Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

So here I go again...

Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

Regarding the OS projects recommended by @Grant:

  • *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
  • Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
  • Both projects have a 3-table approach to tagging as everyone generally seems to recommend

I also found some other questions on SO that I'd missed before:

What I'm currently doing for each of the items I mentioned:

  1. In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
    • Build site-wide tag clouds
    • browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
  2. For search I use Lucene + NHibernate.Search
    • Tags are concat'd into a TagString that is indexed by Lucene
      • So I have the full power of the Lucene query engine (AND / OR / NOT queries)
      • I can search for text and filter by tags at the same time
      • The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
    • Lucene returns a potentially enormous result set, which I paginate to 20 results
    • Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
    • So it's entirely possible that a search results in 0 hits to the DB
  3. Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
  4. Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.

This means that whenever an Entity's tags are modified, I have to:

  • Insert any new Tags that do not already exist
  • Insert/Delete from the EntityTag table
  • Update Entity.TagString
  • Update the Lucene index for the Entity

Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.

Time will tell...

这篇关于用于标记,云和搜索的最佳数据架构(如StackOverflow)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆