Redis 排序集和存储 uid 的最佳方式 [英] Redis sorted sets and best way to store uids

查看:23
本文介绍了Redis 排序集和存储 uid 的最佳方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有由 user_ids 和这些用户 id 的标签组成的数据.user_id 出现多次并且具有预先指定的标签数量 (500),但是这可能会在功能中发生变化.必须存储的是 user_id、它们的标签和它们的计数.我想稍后轻松找到得分最高的标签......等等.每次出现标签时它都会增加

I have data consisting of user_ids and tags of these user ids. The user_ids occur multiple times and have pre-specified number of tags (500) however that might change in the feature. What must be stored is the user_id, their tags and their count. I want later to easily find tags with top score.. etc. Every time a tag appears it is incremented

我在 redis 中的实现是使用排序集完成的

My implementation in redis is done using sorted sets

  • 每个 user_id 都是一个有序集合

  • every user_id is a sorted set

key 是 user_id 并且是一个十六进制数

key is user_id and is a hex number

工作方式如下:

zincrby user_id:x 1 "tag0"

zincrby user_id:x 1 "tag0"

zincrby user_id:x 1 "tag499"

zincrby user_id:x 1 "tag499"

zincrby user_id:y 1 "tag3"

zincrby user_id:y 1 "tag3"

等等

考虑到我想获得最高分的标签,有没有更好的方法?

having in mind that I want to get tags with highest score, is there a better way?

第二个问题是现在我正在使用keys *"来检索这些密钥以进行客户端操作,我知道它不是针对生产系统的.

The second issue is that right now I 'm using "keys *" to retrieve these keys for client side manipulation which I know that it's not aimed towards production systems.

另外,迭代指定数量的键(在 10000 范围内)对于内存问题会很好.我知道密钥必须存储在内存中,但是它们不遵循允许部分检索的特定模式,因此我可以避免zmalloc"错误(4GB 64 位 debian 服务器).键值范围为 2000 万.有什么想法吗?

Plus it would be great for memory problems to iterate through a specified number of keys (in the range of 10000). I know that keys have to be stored in memory, however they don't follow a specific pattern to allow for partial retrieval so I can avoid "zmalloc" error (4GB 64 bit debian server). Keys amount to range of 20 million. Any thoughts?

推荐答案

我的第一点是要注意 4 GB 的空间不足以存储 20M 的排序集.快速尝试表明,20M 用户,每个用户有 20 个标签,在 64 位盒子上将占用大约 8 GB(它考虑了 Redis 2.4 提供的排序集 ziplist 内存优化 - 甚至不要在早期版本中尝试这个).

My first point would be to note that 4 GB are tight to store 20M sorted sets. A quick try shows that 20M users, each of them with 20 tags would take about 8 GB on a 64 bits box (and it accounts for the sorted set ziplist memory optimizations provided with Redis 2.4 - don't even try this with earlier versions).

排序集是支持您的用例的理想数据结构.我会完全按照您的描述使用它们.

Sorted sets are the ideal data structure to support your use case. I would use them exactly as you described.

正如您所指出的,KEYS 不能用于迭代键.它更像是一个调试命令.为了支持密钥迭代,您需要添加一个数据结构来提供此访问路径.Redis 中唯一可以支持迭代的结构是列表和排序集(通过范围方法).然而,他们倾向于将 O(n) 迭代算法转换为 O(n^2)(对于列表)或 O(nlogn)(对于 zset).列表也是存储密钥的糟糕选择,因为在添加/删除密钥时很难维护它.

As you pointed out, KEYS cannot be used to iterate on keys. It is rather meant as a debug command. To support key iteration, you need to add a data structure to provide this access path. The only structures in Redis which can support iteration are the list and the sorted set (through the range methods). However, they tend to transform O(n) iteration algorithms into O(n^2) (for list), or O(nlogn) (for zset). A list is also a poor choice to store keys since it will be difficult to maintain it as keys are added/removed.

一个更有效的解决方案是添加一个由正则集合组成的索引.您需要使用哈希函数将特定用户关联到一个bucket,并将用户id添加到该bucket对应的集合中.如果用户 id 是数值,一个简单的模函数就足够了.如果不是,一个简单的字符串散列函数就可以解决问题.

A more efficient solution is to add an index composed of regular sets. You need to use a hash function to associate a specific user to a bucket, and add the user id to the set corresponding to this bucket. If the user id are numeric values, a simple modulo function will be enough. If they are not, a simple string hashing function will do the trick.

所以为了支持 user:1000、user:2000 和 user:1001 的迭代,让我们选择一个模 1000 函数.user:1000 和 user:2000 将放入桶 index:0,而 user:1001 将放入桶 index:1.

So to support iteration on user:1000, user:2000 and user:1001, let's choose a modulo 1000 function. user:1000 and user:2000 will be put in bucket index:0 while user:1001 will be put in bucket index:1.

所以在 zsets 之上,我们现在有以下键:

So on top of the zsets, we now have the following keys:

index:0 => set[ 1000, 2000 ]
index:1 => set[ 1001 ]

在集合中,不需要键的前缀,它允许 Redis 通过序列化集合来优化内存消耗,前提是它们保持足够小(Sripathi Krishnan 提出的整数集优化).

In the sets, the prefix of the keys is not needed, and it allows Redis to optimize the memory consumption by serializing the sets provided they are kept small enough (integer sets optimization proposed by Sripathi Krishnan).

全局迭代包括从 0 到 1000(不包括)的桶上的简单循环.对于每个桶,应用 SMEMBERS 命令来检索相应的集合,然后客户端可以迭代单个项目.

The global iteration consists in a simple loop on the buckets from 0 to 1000 (excluded). For each bucket, the SMEMBERS command is applied to retrieve the corresponding set, and the client can then iterate on the individual items.

这是一个 Python 示例:

Here is an example in Python:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ----------------------------------------------------

import redis, random

POOL = redis.ConnectionPool(host='localhost', port=6379, db=0)

NUSERS = 10000
NTAGS = 500
NBUCKETS = 1000

# ----------------------------------------------------
# Fill redis with some random data

def fill(r):
  p = r.pipeline()
  # Create only 10000 users for this example
  for id in range(0,NUSERS):
    user = "user:%d" % id
    # Add the user in the index: a simple modulo is used to hash the user id
    # and put it in the correct bucket
    p.sadd( "index:%d" % (id%NBUCKETS), id )
    # Add random tags to the user
    for x in range(0,20):
      tag = "tag:%d" % (random.randint(0,NTAGS))
      p.zincrby( user, tag, 1 )
    # Flush the pipeline every 1000 users
    if id % 1000 == 0:
      p.execute()
      print id
  # Flush one last time
  p.execute()

# ----------------------------------------------------
# Iterate on all the users and display their 5 highest ranked tags

def iterate(r):
  # Iterate on the buckets of the key index
  # The range depends on the function used to hash the user id
  for x in range(0,NBUCKETS):
    # Iterate on the users in this bucket
    for id in r.smembers( "index:%d"%(x) ):
      user = "user:%d" % int(id)
      print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True )

# ----------------------------------------------------
# Main function

def main():
  r = redis.Redis(connection_pool=POOL)
  r.flushall()
  m = r.info()["used_memory"]
  fill(r)
  info = r.info()
  print "Keys: ",info["db0"]["keys"]
  print "Memory: ",info["used_memory"]-m
  iterate(r)

# ----------------------------------------------------

main()

通过调整常量,您还可以使用此程序来评估此数据结构的全局内存消耗.

By tweaking the constants, you can also use this program to evaluate the global memory consumption of this data structure.

IMO 这个策略简单而有效,因为它提供 O(1) 复杂度来添加/删除用户,以及真正的 O(n) 复杂度来迭代所有项目.唯一的缺点是关键迭代顺序是随机的.

IMO this strategy is simple and efficient, because it offers O(1) complexity to add/remove users, and true O(n) complexity to iterate on all items. The only downside is the key iteration order is random.

这篇关于Redis 排序集和存储 uid 的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆