如何为可搜索性构建数据 [英] how to structure data for searchability

查看:104
本文介绍了如何为可搜索性构建数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为音乐播放列表专门编写一个搜索应用程序。

流派和文件格式不同于播放列表和播放列表,有时在播放列表中也有差异。还有一个同义词标签的概念(例如,城市将覆盖嘻哈和r& b,但不是其他方式)。

以下是搜索条件和我的预期结果。

gospel:应该返回所有播放列表至少一首福音歌曲。首先播放所有福音歌曲的播放列表。
urban:应该归还所有的r& b和hiphop。所有城市轨道的播放列表将首先出现。
hiphop:应该返回所有的hiphop,但不是r& b。
flac:应该返回包含flac文件的所有播放列表。从纯粹的flac开始。
hiphop flac:首先应该返回hiphop flacs,其次是其他hiphop音频
hiphop AND flac:应该仅返回hiphop flacs
hiphop音频:应该返回hiphop flacs,hiphop mp3等



当我刚开始这个项目时,我正在考虑索引所有这些的最佳方法。像Lucene这样的全文搜索在这里有什么用处?注意我没有描述这些播放列表的任何文本,但我可以生成一些。



我正在考虑将所有这些术语组织为标签并将它们存储在(pk(id),desc)
table:tag(pk(id),desc)
$ b

table:playlist b $ b table:playlist_has_tag(pk(link_id,tag_id))

解决城市== hiphop || rnb的东西,我可能会添加一个tag_synonyms表:



table:tag_synonyms(pk(tag_id,synonym_tag_id))

然后,我有两条记录表明城市包含hiphop和rnb:
城市的标记ID,hiphop的标记ID
城市的标记ID,rnb的标记ID

我感觉虽然使用这种方法查询可能变得相当复杂。



CouchDB可以在这里使用吗?我目前正在使用PostgreSQL。是否有一些软件可以让这种事情变得简单?



我希望能够在未来深入挖掘并支持复杂的搜索术语,例如:



(hiphop或house)AND filetype:mp3 AND artwork:no



还包含持续时间等内容。 解决方案

如果您试图考虑如何构建数据以进行搜索,您很有可能会错过一个重要的搜索,你可以真正在你的应用程序中使用。 b
$ b

或者(这是来自于经验),你最终会重新发明各种索引技术。

我对lucene有一些经验(有java和.net版本,有一个C端口,但我不确定这些日子有多活跃) - 它可以使用存储在任何结构中的数据来完成令人惊奇的事情。



我喜欢沙发数据库的外观,取决于您想要尝试新的强大功能,或者去争取(目前)相当激烈的事情:lucene。

I am writing a search application specifically for music playlists.

The genre and file format differs from playlist to playlist, and sometimes within the playlist there are differences too. There is also a concept of "synonymous" tags (e.g. urban would cover both hiphop and r&b, but not the other way around).

Below is a list of search terms and my expected results.

gospel: should return all playlists with at least one gospel song. playlists with all gospel songs would be shown first. urban: should return all r&b and hiphop. again playlists with all urban tracks would come first. hiphop: should return all hiphop but not r&b. flac: should return all playlists that contain flac files. starting with the ones that are pure flac. hiphop flac: should return hiphop flacs first, followed by other hiphop audio hiphop AND flac: should return hiphop flacs only hiphop audio: should return hiphop flacs, hiphop mp3s, etc

As I'm just starting this project, I'm thinking of the best way to index all this. Would a fulltext search thing like Lucene be of any use here? Note I don't have any text describing these playlists, but I could generate some.

I'm thinking of organising all these terms as "tags" and storing them in the db many-to-many.

table: playlist ( pk(id), desc ) table: tag ( pk(id), desc ) table: playlist_has_tag ( pk(link_id, tag_id) )

To solve the urban == hiphop || rnb thing, I would maybe add a tag_synonyms table:

table: tag_synonyms ( pk(tag_id, synonym_tag_id) )

Then I'd have two records to indicate that urban encompasses hiphop and rnb: urban's tag id, hiphop's tag id urban's tag id, rnb's tag id

I'm feeling though that the query could be come quite convoluted using this approach.

Could CouchDB be of use here? I'm currently using PostgreSQL. Is there some software out there that will make this kind of thing easy?

I would like to be able to drill down and support complex search terms in the future like:

(hiphop OR house) AND filetype:mp3 AND artwork:no

And also incorporate things like duration, etc.

解决方案

If you try to think too hard on how to structure your data for searching, there is a good chance you will miss an important search that you could have really used in your app.

Alternatively (and this is from experience) you end up re-inventing all sorts of indexing techniques.

I have some experience with lucene (there is java and .net version, there was a C port but I am not sure how alive it is these days) - and it can do amazing things with data that is stored in any structure.

I like the look of couch db, just depends how much you want to experiment with something new and powerful, or go for something which is (currently) fairly battle hardened: lucene.

这篇关于如何为可搜索性构建数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆