用于标记，云和搜索的最佳数据架构（如StackOverflow）？ [英] Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

查看：173 发布时间：2017/3/21 23:42:15 database-design tags full-text-search tag-cloud

本文介绍了用于标记，云和搜索的最佳数据架构（如StackOverflow）？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我很想知道Stack Overflow的标签和搜索是如何构建的，因为它似乎工作得很好。

如果我想要执行以下所有操作，那么什么是好的数据库/搜索模型：

在各种实体上存储标签（如何规范化，即Entity，Tag和Entity_Tag表？）
- 搜索具有特定标签的项目
- 构建适用于特定搜索结果集的所有标签的标签云
- 如何显示标签列表搜索结果中的每个项目？

也许存储标签以标准化的形式，但也作为空格分隔的字符串，用于＃2，＃4和可能＃3。想法？

我听说过Stack Overflow使用Lucene进行搜索。真的吗？我听说过几个讨论SQL优化的播客，但没有关于Lucene。如果他们使用Lucene，我想知道有多少搜索结果来自Lucene，以及下钻标签云是否来自Lucene。

解决方案

哇，我刚刚写了一个大帖子，SO cho cho and and and，，，，，，。。。。。，。。，，，，，，，，，，。 aaargh。

所以这里我再次去...

关于堆栈溢出，事实证明，他们使用 SQL server 2005全文搜索。

关于@Grant推荐的操作系统项目：

DotNetKicks 使用DB进行标记和Lucene进行全文搜索。似乎无法将全文搜索与标签搜索结合使用。

Kigg 对搜索和标签查询使用Linq-to-SQL。这两个查询都加入了Stories-> StoryTags->标签。

这两个项目都有一个3表格的方法来标记，因为每个人都喜欢推荐

我还发现我以前想过的一些其他问题：

您如何推荐实施标签或标签？

如何组织数据进行可搜索性？

标记数据库设计

我目前正在为每个提到的项目做些什么：

在DB中，3个表：Entity，Tag，Entity_Tag。我使用数据库：
- 构建站点范围的标签云
- 按标签浏览（即像SO的 /questions/tagged/ASP.NET ）

对于搜索，我使用Lucene + NHibernate.Search
- 标签连接到由Lucene索引的TagString
  - 所以我有Lucene查询引擎的全部功能（AND / OR / NOT queries）
  - 我可以同时按标签过滤和
  - Lucene分析器将字符合并到更好的标签搜索中（即，标签搜索测试也会找到标记为测试的东西）
- Lucene返回一个潜在的巨大结果集，我分页到20个结果
- 然后NHibernate加载结果Ent id由ID或者从DB或实体缓存
- 所以完全有可能搜索结果是0到达数据库

不这样做但是，我想我可能会尝试从Lucene的TagString中找到一个构建标签云的方法，而不是采用另一个数据库命中。

还没有这样做，但是我可能会将TagString存储在数据库中，以便我可以显示一个实体的标签列表，而无需再进行两次连接。

意味着每当实体的标签被修改时，我必须：

插入任何不存在的新标签

从EntityTag表插入/删除

更新Entity.TagString

更新实体的Lucene索引

鉴于我的应用程序读取与写入的比例非常大，我认为我可以这样做。唯一真正耗时的部分是Lucene索引，因为Lucene只能从其索引中插入和删除，所以我必须重新索引整个实体才能更新TagString。我不是很兴奋，但我认为，如果我在后台线程中这样做，这将是罚款。

时间会告诉...

I'd love to know how Stack Overflow's tagging and search is architected, because it seems to work pretty well.

What is a good database/search model if I want to do all of the following:

Storing Tags on various entities, (how normalized? i.e. Entity, Tag, and Entity_Tag tables?)
- Searching for items with particular tags
- Building a tag cloud of all tags that apply to a particular search result set
- How to show a tag list for each item in a search result?

Perhaps it makes sense to store the tags in a normalized form, but also as a space-delimited string for the purposes of #2, #4, and perhaps #3. Thoughts?

I have heard it said that Stack Overflow uses Lucene for search. Is that true? I've heard a couple of podcasts discussing SQL optimization, but nothing about Lucene. If they do use Lucene, I'm wondering how much of the search result comes from Lucene, and whether the "drill-down" tag cloud comes from Lucene.

解决方案

Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

So here I go again...

Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

Regarding the OS projects recommended by @Grant:

*DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
Both projects have a 3-table approach to tagging as everyone generally seems to recommend

I also found some other questions on SO that I'd missed before:

What I'm currently doing for each of the items I mentioned:

In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
- Build site-wide tag clouds
- browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
For search I use Lucene + NHibernate.Search
- Tags are concat'd into a TagString that is indexed by Lucene
  - So I have the full power of the Lucene query engine (AND / OR / NOT queries)
  - I can search for text and filter by tags at the same time
  - The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
- Lucene returns a potentially enormous result set, which I paginate to 20 results
- Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
- So it's entirely possible that a search results in 0 hits to the DB
Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.

This means that whenever an Entity's tags are modified, I have to:

Insert any new Tags that do not already exist
Insert/Delete from the EntityTag table
Update Entity.TagString
Update the Lucene index for the Entity

Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.

Time will tell...

这篇关于用于标记，云和搜索的最佳数据架构（如StackOverflow）？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于标记，云和搜索的最佳数据架构（如StackOverflow）？ [英] Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

问题描述

相关文章

其他数据库最新文章

热门教程

热门工具

登录关闭

用于标记，云和搜索的最佳数据架构（如StackOverflow）？ [英] Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

问题描述

相关文章

其他数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭