Redis排序集和存储uid的最佳方法 [英] Redis sorted sets and best way to store uids

查看:107
本文介绍了Redis排序集和存储uid的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有由user_id和这些用户ID的标签组成的数据. user_id多次出现,并具有预先指定的标签数量(500),但是功能可能会更改.必须存储的是user_id,其标记和计数. 我想稍后再轻松找到得分最高的标签.等等.每次出现标签时,标签都会递增

I have data consisting of user_ids and tags of these user ids. The user_ids occur multiple times and have pre-specified number of tags (500) however that might change in the feature. What must be stored is the user_id, their tags and their count. I want later to easily find tags with top score.. etc. Every time a tag appears it is incremented

我在Redis中的实现是使用排序集完成的

My implementation in redis is done using sorted sets

  • 每个user_id是一个已排序的集合

  • every user_id is a sorted set

键是user_id,并且是十六进制数字

key is user_id and is a hex number

像这样工作:

zincrby user_id:x 1个"tag0"

zincrby user_id:x 1 "tag0"

zincrby user_id:x 1个"tag499"

zincrby user_id:x 1 "tag499"

zincrby user_id:y 1"tag3"

zincrby user_id:y 1 "tag3"

以此类推

请记住,我想获得得分最高的标签,还有更好的方法吗?

having in mind that I want to get tags with highest score, is there a better way?

第二个问题是,现在我正在使用键*"来检索这些键以进行客户端操作,我知道这并不针对生产系统.

The second issue is that right now I 'm using "keys *" to retrieve these keys for client side manipulation which I know that it's not aimed towards production systems.

另外,对于内存问题而言,遍历指定数量的键(范围为10000)将非常有用.我知道密钥必须存储在内存中,但是它们不会跟随 一种允许部分检索的特定模式,这样就可以避免出现"zmalloc"错误(4GB 64位Debian服务器). 密钥范围为2000万个. 有什么想法吗?

Plus it would be great for memory problems to iterate through a specified number of keys (in the range of 10000). I know that keys have to be stored in memory, however they don't follow a specific pattern to allow for partial retrieval so I can avoid "zmalloc" error (4GB 64 bit debian server). Keys amount to range of 20 million. Any thoughts?

推荐答案

我的第一点是要注意4 GB的空间很难存储20M的排序集.快速尝试显示,2000万用户(每个用户使用20个标签)在一个64位盒上大约需要8 GB(并且考虑到Redis 2.4提供的排序后的ziplist内存优化-甚至不要在早期版本中尝试)

My first point would be to note that 4 GB are tight to store 20M sorted sets. A quick try shows that 20M users, each of them with 20 tags would take about 8 GB on a 64 bits box (and it accounts for the sorted set ziplist memory optimizations provided with Redis 2.4 - don't even try this with earlier versions).

排序集是支持您的用例的理想数据结构.我会完全按照您的描述使用它们.

Sorted sets are the ideal data structure to support your use case. I would use them exactly as you described.

正如您所指出的,KEYS不能用于迭代键.而是将其作为调试命令.为了支持密钥迭代,您需要添加数据结构以提供此访问路径. Redis中唯一可以支持迭代的结构是列表和排序集(通过range方法).但是,他们倾向于将O(n)迭代算法转换为O(n ^ 2)(用于列表)或O(nlogn)(用于zset).列表也是存储密钥的不佳选择,因为在添加/删除密钥时将很难维护列表.

As you pointed out, KEYS cannot be used to iterate on keys. It is rather meant as a debug command. To support key iteration, you need to add a data structure to provide this access path. The only structures in Redis which can support iteration are the list and the sorted set (through the range methods). However, they tend to transform O(n) iteration algorithms into O(n^2) (for list), or O(nlogn) (for zset). A list is also a poor choice to store keys since it will be difficult to maintain it as keys are added/removed.

一种更有效的解决方案是添加由常规集组成的索引.您需要使用哈希函数将特定用户与存储桶相关联,并将用户ID添加到与此存储桶相对应的集合中.如果用户ID是数字值,那么简单的取模函数就足够了.如果不是,那么简单的字符串哈希函数就可以解决问题.

A more efficient solution is to add an index composed of regular sets. You need to use a hash function to associate a specific user to a bucket, and add the user id to the set corresponding to this bucket. If the user id are numeric values, a simple modulo function will be enough. If they are not, a simple string hashing function will do the trick.

因此,要支持用户:1000,用户:2000和用户:1001上的迭代,让我们选择一个模1000函数.用户:1000和用户:2000将被放置在存储区索引:0中,而用户:1001将被放置在存储区索引:1中.

So to support iteration on user:1000, user:2000 and user:1001, let's choose a modulo 1000 function. user:1000 and user:2000 will be put in bucket index:0 while user:1001 will be put in bucket index:1.

因此,在zset之上,我们现在具有以下键:

So on top of the zsets, we now have the following keys:

index:0 => set[ 1000, 2000 ]
index:1 => set[ 1001 ]

在集合中,不需要键的前缀,它允许Redis通过序列化集合(如果它们保持足够小)来优化内存消耗(Sripathi Krishnan提出的整数集合优化).

In the sets, the prefix of the keys is not needed, and it allows Redis to optimize the memory consumption by serializing the sets provided they are kept small enough (integer sets optimization proposed by Sripathi Krishnan).

全局迭代包含一个从0到1000(不包括)的存储桶上的简单循环.对于每个存储桶,将应用SMEMBERS命令来检索相应的集合,然后客户可以迭代各个项目.

The global iteration consists in a simple loop on the buckets from 0 to 1000 (excluded). For each bucket, the SMEMBERS command is applied to retrieve the corresponding set, and the client can then iterate on the individual items.

以下是Python中的示例:

Here is an example in Python:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ----------------------------------------------------

import redis, random

POOL = redis.ConnectionPool(host='localhost', port=6379, db=0)

NUSERS = 10000
NTAGS = 500
NBUCKETS = 1000

# ----------------------------------------------------
# Fill redis with some random data

def fill(r):
  p = r.pipeline()
  # Create only 10000 users for this example
  for id in range(0,NUSERS):
    user = "user:%d" % id
    # Add the user in the index: a simple modulo is used to hash the user id
    # and put it in the correct bucket
    p.sadd( "index:%d" % (id%NBUCKETS), id )
    # Add random tags to the user
    for x in range(0,20):
      tag = "tag:%d" % (random.randint(0,NTAGS))
      p.zincrby( user, tag, 1 )
    # Flush the pipeline every 1000 users
    if id % 1000 == 0:
      p.execute()
      print id
  # Flush one last time
  p.execute()

# ----------------------------------------------------
# Iterate on all the users and display their 5 highest ranked tags

def iterate(r):
  # Iterate on the buckets of the key index
  # The range depends on the function used to hash the user id
  for x in range(0,NBUCKETS):
    # Iterate on the users in this bucket
    for id in r.smembers( "index:%d"%(x) ):
      user = "user:%d" % int(id)
      print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True )

# ----------------------------------------------------
# Main function

def main():
  r = redis.Redis(connection_pool=POOL)
  r.flushall()
  m = r.info()["used_memory"]
  fill(r)
  info = r.info()
  print "Keys: ",info["db0"]["keys"]
  print "Memory: ",info["used_memory"]-m
  iterate(r)

# ----------------------------------------------------

main()

通过调整常量,您还可以使用此程序来评估此数据结构的全局内存消耗.

By tweaking the constants, you can also use this program to evaluate the global memory consumption of this data structure.

IMO这种策略简单有效,因为它提供了O(1)复杂性来添加/删除用户,并提供了真正的O(n)复杂性来迭代所有项目.唯一的缺点是关键的迭代顺序是随机的.

IMO this strategy is simple and efficient, because it offers O(1) complexity to add/remove users, and true O(n) complexity to iterate on all items. The only downside is the key iteration order is random.

这篇关于Redis排序集和存储uid的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆