高效的数据结构网络化的标签? [英] Efficient Datastructure for tags?

查看:186
本文介绍了高效的数据结构网络化的标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下,你想序列化和反序列化计算器职位,包括他们的标签,高效的空间尽可能(二进制),也做标签查找时的性能。是否有一个良好的数据结构的那种情景?

Imagine you wanted to serialize and deserialize stackoverflow posts including their tags as space efficiently as possible (in binary), but also for performance when doing tag lookups. Is there a good datastructure for that kind of scenario?

#1大约有28532不同的标记,你可以创建一个表的所有标签并为其分配一个整数,此外,你还可以按频率进行排序,从而使最常见的标签有最低的数字。还是将它们存储只是像格式的字符串1 32 45似乎从搜索和存储角度看有点inefficent博思

Stackoverflow has about 28532 different tags, you could create a table with all tags and assign them an integer, Furthermore you could sort them by frequency so that the most common tags have the lowest numbers. Still storing them simply like a string in the format "1 32 45" seems a bit inefficent borth from a searching and storing perspective

另一个想法是保存标记为可变bitarray这是从查找和序列透视吸引力。由于最常见的标签是首先你可能会适合标记成少量内存。

Another idea would be to save tags as a variable bitarray which is attractive from a lookup and serializing perspective. Since the most common tags are first you potentially could fit tags into a small amount of memory.

问题是当然,这不常见的标签将产生巨大的bitarrays的。是否有压缩为bitarrays 0的大跨度任何标准?还是应该一整使用一些其他的结构?

The problem would be of course that uncommon tags would yield huge bitarrays. Is there any standard for "compressing" bitarrays for large spans of 0's? Or should one use some other structure completely?

修改

我不找一个数据库解决方案,或者我需要保持在内存中整个表的解决方案,但是对于过滤个别项目

I'm not looking for a DB solution or a solution where I need to keep entire tables in memory, but a structure for filtering individual items

推荐答案

不破坏你的问题,但28K的记录实在是没有那么多。你也许过早优化?
我先坚持用常规指数上一个数据库表。他们使用的harshing启发式通常是非常有效的,不平凡的打(或者如果可以的话是不是真的值得时间的努力和很大收益就够了吗?)。

Not to undermine your question but 28k records is really not all that many. Are you perhaps optimizing prematurely? I would first stick to using 'regular' indices on a DB table. The harshing heuristics they use are typically very efficient and not trivial to beat (or if you can is it really worth the effort in time and are the gains large enough?).

也取决于你实际做标记查询,确实注意到了200ms的时间用户获得您优化

Also depending on where you actually do the tag query, is the user really noticing the 200ms time gain you optimized for?

第一小节然后优化: - )

First measure then optimize :-)

修改

如果没有DB我想可能有一个主表保存所有标签与标识一起(如果可能持有它在记忆中)。保持ID的常规排序列表连同每一个职位。

Without a DB I would probably have a master table holding all tags together with an ID (if possible hold it in memory). Keep a regular sorted list of IDs together with each post.

不知道有多少基于通用存储会有所帮助。排序的列表中,你可以做一个定期的二进制搜索可能被证明不够快;措施: - )

Not sure how much storage based on commonality would help. A sorted list in which you can do a regular binary search may prove fast enough; measure :-)

在这里,你会需要遍历为每个变量查询所有帖子,虽然

Here you would need to iterate all posts for every tag query though.

如果这最终被慢,你可以求助于后存储标识符的口袋,每个标记。这个数据结构可能会变得虽然有点大,可能需要一个文件来寻找和阅读反对。

If this ends up being to slow you could resort to storing a pocket of post identifiers for each tag. This data structure may become somewhat large though and may require a file to seek and read against.

对于较小的表,你可以求助于基于散列值来构建一(有重复)。这样,您就可以用它来快速获取到的职位需要进一步检查,看看它们是否匹配与否的一个较小的候选名单。

For a smaller table you could resort to build one based on a hashed value (with duplicates). This way you could use it to quickly get down to a smaller candidate list of posts that need further checking to see if they match or not.

这篇关于高效的数据结构网络化的标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆