使用Redis排序集进行索引 [英] Indexing using Redis sorted sets

查看:129
本文介绍了使用Redis排序集进行索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获得一些关于我正在考虑使用Redis排序集实现可搜索索引的两种方法的反馈和建议。

I would like to get some feedback and suggestions regarding two approaches I'm considering to implementing searchable indexes using Redis sorted sets.

情况和目标

我们目前有一些我们存储在Cassandra中的键值表,我们希望它们有索引。例如,一个表将包含人员的记录,而Cassandra表将id作为其主键,序列化对象作为值。该对象将具有诸如first_name,last_name,last_updated等字段。

We currently have some key-value tables we're storing in Cassandra, and which we would like to have indexes for. For example, one table would contain records of people, and the Cassandra table would have id as its primary key, and the serialized object as the value. The object would have fields such as first_name, last_name, last_updated, and others.

我们想要的是能够进行诸如last_name ='Smith'AND之类的搜索first_name>'Joel',last_name<'Aaronson',last_name ='Smith'AND first_name ='Winston'依此类推。搜索应该产生匹配的ID,以便我们可以从Cassandra中检索对象。我认为上述搜索可以使用单个索引完成,按字典顺序排序为last_name,first_name和last_updated。如果我们需要使用不同的顺序进行一些搜索(例如first_name ='Zeus'),我们可以使用类似的索引(例如first_name,last_updated)。

What we want is to be able to do searches such as "last_name = 'Smith' AND first_name > 'Joel'" , "last_name < 'Aaronson'" , "last_name = 'Smith' AND first_name = 'Winston'" and so on. The searches should yield the ids of matches so we can then retrieve the objects from Cassandra. I'm thinking the above searches could be done with a single index, sorted lexicographically by last_name, first_name, and last_updated. If we need some searches using a different order (e.g. "first_name = 'Zeus'") we can have a similar index that would allow those (e.g. first_name, last_updated).

我们正在考虑使用Redis,因为我们需要能够每分钟处理大量的写入操作。我已经阅读了Redis排序集的一些常用方法,并提出了两种可能的实现方式:

We are looking at using Redis for this, because we need to be able to handle a large number of writes per minute. I've read up on some common ways Redis sorted sets are used, and come up with two possible implementations:

选项1:每个索引的单个排序集

对于last_name,first_name,last_updated的索引,我们在Redis下的密钥索引下有一个有序集:people:last_name:first_name:last_updated,其中包含格式为last_name的字符串:first_name:last_updated:id。例如:

For our index by last_name, first_name, last_updated, we would have a sorted set in Redis under the key indexes:people:last_name:first_name:last_updated , which would contain strings with the format last_name:first_name:last_updated:id . For example:

smith:joel:1372761839.444:0azbjZRHTQ6U8enBw6BJBw

smith:joel:1372761839.444:0azbjZRHTQ6U8enBw6BJBw

(对于我可能使用的分隔符':: '而不是':'或其它能够更好地使用词典排序的东西,但是现在让我们忽略它)

(For the separator I might use '::' rather than ':' or something else to work better with the lexicographic ordering, but let's ignore that for now)

这些项目都将得到0分,这样排序集将按字典顺序按字符串本身排序。如果我想做一个像last_name ='smith'AND first_name<'bob'这样的查询,我需要获取列表中smith:bob之前的所有项目。

The items would all be given score 0 so that the sorted set will just be sorted lexicographically by the strings themselves. If I then want to do a query like "last_name = 'smith' AND first_name < 'bob'", I would need to get all the items in the list that come before 'smith:bob'.

据我所知,这种方法有以下缺点:

As far as I can tell, there are the following drawbacks to this approach:


  1. 没有Redis用于根据字符串值选择范围。这个名为ZRANGEBYLEX的功能由Salvatore Sanfilippo在 https://github.com/antirez/redis/issues/324上提出。 ,但没有实现,所以我必须使用二进制搜索找到端点并自己获取范围(可能使用Lua,或者在应用程序级别使用Python,这是我们用来访问Redis的语言) )。

  2. 如果我们想要为索引条目包含生存时间,那么最简单的方法就是有一个定期计划的任务,它贯穿整个索引和删除过期的项目。

  1. There is no Redis function to select a range based on the string value. This feature, called ZRANGEBYLEX, has been proposed by Salvatore Sanfilippo at https://github.com/antirez/redis/issues/324 , but is not implemented, so I would have to find the endpoints using binary searches and get the range myself (perhaps using Lua, or at the application-level with Python which is the language we're using to access Redis).
  2. If we want to include a time-to-live for index entries, it seems the simplest way to do it would be having a regularly scheduled task which goes through the whole index and removes expired items.

选项2:小型排序集,按last_updated排序

这种方法类似,除了我们会有许多较小的有序集合,每个集合都有一个类似时间的值,例如分数的last_updated。例如,对于相同的last_name,first_name,last_updated索引,我们将为每个last_name,first_name组合设置一个排序集。例如,密钥可能是索引:people:last_name = smith:first_name = joel,它将为我们称为Joel Smith的每个人创建一个条目。每个条目的id都是id,其得分是last_updated值。例如:

This approach would be similar, except we would have many, smaller, sorted sets, with each having a time-like value such as last_updated for the scores. For example, for the same last_name, first_name, last_updated index, we would have a sorted set for each last_name, first_name combination. For example, the key might be indexes:people:last_name=smith:first_name=joel , and it would have an entry for each person we have called Joel Smith. Each entry would have as its name the id and its score the last_updated value. E.g.:

值:0azbjZRHTQ6U8enBw6BJBw;得分:1372761839.444

value: 0azbjZRHTQ6U8enBw6BJBw ; score: 1372761839.444

这方面的主要优点是(a)我们知道除了last_updated之外的所有字段都很容易进行的搜索,以及(b)实现时间到-live非常简单,使用ZREMRANGEBYSCORE。

The main advantages to this are (a) searches where we know all the fields except last_updated would be very easy, and (b) implementing a time-to-live would be very easy, using the ZREMRANGEBYSCORE.

这个缺点对我来说似乎非常大:

The drawback, which seems very large to me is:


  1. 管理和搜索这种方式似乎有很多复杂性。例如,我们需要索引来跟踪其所有键(例如,我们希望在某些时候清理)​​并以分层方式执行此操作。诸如last_name<'smith'之类的搜索需要首先查看所有姓氏的列表以查找史密斯之前的那些,然后查找每个查找它包含的所有名字的人,然后查找每个姓氏。那些从排序集中获取所有项目的人。换句话说,很多组件需要建立并担心。

结束

所以在我看来,第一种选择会更好,尽管有其缺点。我非常感谢有关这两个或其他可能解决方案的任何反馈(即使它们我们应该使用除Redis之外的其他东西)。

So it seems to me the first option would be better, in spite of its drawbacks. I would very much appreciate any feedback regarding these two or other possible solutions (even if they're that we should use something other than Redis).

推荐答案


  1. 我强烈反对使用Redis。你将存储大量额外的指针数据,如果你决定要做更复杂的查询,例如 SELECT WHERE first_name LIKE'jon%'你是会遇到麻烦。如果要同时搜索两个字段,还需要设计跨越多列的额外的非常大的索引。你基本上需要不断攻击和重新设计搜索框架。使用 Elastic Search Solr ,或者已经构建的任何其他框架,用于执行您要执行的操作。 Redis很棒,有很多好的用途。这不是其中之一。

  1. I strongly discourage the use of Redis for this. You'll be storing a ton of extra pointer data, and if you ever decide you want to do more complicated queries like, SELECT WHERE first_name LIKE 'jon%' you're going to run into trouble. You'll also need to engineer extra, very big indexes that cross multiple columns, in case you want to search for two fields at the same time. You'll essentially need to keep hacking away and reengineering a search framework. You'd be much better off using Elastic Search or Solr, or any of the other frameworks already built to do what you're trying to do. Redis is awesome and has lots of good uses. This is not one of them.

警告一边,回答你的实际问题:我认为你最好使用你的第一个解决方案的变体。每个索引使用一个有序的集合,但只需将字母转换为数字。将您的字母转换为某个十进制值。您可以使用ASCII值,或者只是按字典顺序将每个字母分配给1-26值,假设您使用的是英语。标准化,以便每个字母占用相同的数字长度(因此,如果26是您的最大数字,1将写为01)。然后将这些与前面的小数点一起追加并将其用作每个索引的得分(即hat将为.080120)。这将使您在单词和这些数字之间进行正确排序的1对1映射。当你搜索,从字母转换为数字,然后你将能够使用所有Redis'漂亮的排序集函数,如 ZRANGEBYSCORE ,而无需重写它们。 Redis的功能写得非常非常优秀,所以你最好尽可能地使用它们而不是自己编写。

Warning aside, to answer your actual question: I think you'd be best served using a variant of your first solution. Use a single sorted set per index, but just convert your letters to numbers. Convert your letters to some decimal value. You can use the ASCII value, or just assign each letter to a 1-26 value in lexicographic order, assuming you're using English. Standardize, so that each letter takes up the same numeric length (so, if 26 is your biggest number, 1 would be written "01"). Then just append these together with a decimal point in front and use that as your score per index (i.e. "hat" would be ".080120"). This will let you have a properly ordered 1-to-1 mapping between words and these numbers. When you search, convert from letters to numbers, and then you'll be able to use all of Redis' nice sorted set functions like ZRANGEBYSCORE without needing to rewrite them. Redis' functions are written very, very optimally, so you're much better off using them whenever possible instead of writing your own.

这篇关于使用Redis排序集进行索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆