什么时候在 DSE 中使用 Cassandra 和 Solr? [英] When to use Cassandra vs. Solr in DSE?

查看:32
本文介绍了什么时候在 DSE 中使用 Cassandra 和 Solr?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将 DSE 用于 Cassandra/Solr 集成,以便数据存储在 Cassandra 中并在 Solr 中建立索引.分别使用Cassandra处理CRUD操作和使用Solr进行全文搜索是很自然的,而DSE确实可以简化Cassandra和Solr之间的数据同步.

I'm using DSE for Cassandra/Solr integration so that data are stored in Cassandra and indexed in Solr. It's very natural to use Cassandra to handle CRUD operation and use Solr for full text search respectively, and DSE can really simplify data synchronization between Cassandra and Solr.

然而,当谈到查询时,实际上有两种方法可以走:Cassandra 二级/手动配置索引与 Solr.我想知道何时使用哪种方法以及一般的性能差异是什么,尤其是在 DSE 设置下.

When it comes to query, however, there are actually two ways to go: Cassandra secondary/manual configured index vs. Solr. I want to know when to use which method and what's the performance difference in general, especially under DSE setup.

这是我项目中的一个示例用例.我有一个存储一些项目实体数据的 Cassandra 表.除了基本的 CRUD 操作,我还需要在某个字段(比如类别)上按相等检索项目,然后按某个顺序排序(在我的例子中是一个 like_count 字段).

Here is one example use case in my project. I have a Cassandra table storing some item entity data. Besides the basic CRUD operation, I also need to retrieve items by equality on some field (say category) and then sort by some order (in my case here, a like_count field).

我能想到三种不同的处理方式:

I can think of three different ways to handle it:

  1. 在 Solr 架构中为 category 和 like_count 字段声明indexed=true",并在 Solr 中查询
  2. 在 Cassandra 中使用主键(category、like_count、id)创建非规范化表
  3. 在Cassandra中创建一个主键(category, order, id)的非规范化表,并使用外部组件,如Spark/Storm,按like_count对items进行排序

第一种方法似乎是最容易实现和维护的.我只是写了一些简单的 Solr 访问代码,其余的繁重工作由 Solr/DSE 搜索处理.

The first method seems to be the simplest to implement and maintain. I just write some trivial Solr accessing code and the rest heavy lifting are handled by Solr/DSE search.

第二种方法需要在创建和更新时手动反规范化.我还需要维护一个单独的表.还有一个墓碑问题,因为 like_count 可能会经常更新.好的部分是读取速度可能会更快(如果没有过多的墓碑).

The second method requires manual denormalization on create and update. I also need to maintain a separate table. There is also tombstone issue as the like_count can possibly be updated frequently. The good part is that the read may be faster (if there are no excessive tombstones).

第三种方法可以缓解墓碑​​问题,代价是增加一个用于排序的组件.

The third method can alleviate the tombstone issue at the cost of one extra component for sorting.

您认为哪种方法是最佳选择?性能上有什么区别?

Which method do you think is the best option? What is the difference in performance?

推荐答案

Cassandra 二级索引的用例有限:

Cassandra secondary indexes have limited use cases:

  1. 索引的列不超过几列.
  2. 查询中只有一个索引列.
  3. 用于高基数数据(相对唯一的列值)的节点间流量过多
  4. 低基数数据的节点间流量过多(匹配的行百分比很高)
  5. 需要提前了解查询,以便围绕它们优化数据模型.

由于这些限制,应用程序通常会创建索引表",这些表由所需的任何列编制索引.这需要将数据从主表复制到每个索引表,或者需要额外的查询来读取索引表,然后在从索引表读取主键后从主表读取实际行.对多列的查询必须提前手动编制索引,这会导致临时查询出现问题.任何重复的都必须由应用手动更新到每个索引表中.

Because of these limitations, it is common for apps to create "index tables" which are indexed by whatever column is desired. This requires either that data be duplicated from the main table to each index table, or an extra query will be needed to read the index table and then read the actual row from the main table after reading the main key from the index table. Queries on multiple columns will have to be manually indexed in advance, making ad hoc queries problematic. And any duplicated will have to be manually updated by the app into each index table.

除此之外……在从适度数量的节点中选择适度"行数的情况下,它们将正常工作,并且查询是预先明确指定的,而不是临时指定的.

Other than that... they will work fine in cases where a "modest" number of rows will be selected from a modest number of nodes, and queries are well specified in advance and not ad hoc.

DSE/Solr 更适合:

DSE/Solr is better for:

  1. 中等数量的列被编入索引.
  2. 引用了多个列/字段的复杂查询 - Lucene 并行匹配查询中的所有指定字段.Lucene 为每个节点上的数据建立索引,因此节点可以并行查询.
  3. 一般的即席查询,其中精确查询事先未知.
  4. 富文本查询,例如关键字搜索、通配符、模糊/类似、范围、不等式.

使用 Solr 索引存在性能和容量成本,因此建议进行概念验证以评估需要多少额外的 RAM、存储和节点,这取决于您索引的列数、文本量索引,以及任何文本过滤复杂性(例如,n-gram 需要更多).如果所有列都被索引,它的范围可能从相对少量索引列的 25% 增加到 100%.此外,您需要有足够的节点,以便每个节点的 Solr 索引适合 RAM,如果使用 SSD,则主要适合 RAM.Solr 数据中心目前不推荐使用 vnode.

There is a performance and capacity cost to using Solr indexing, so a proof of concept implementation is recommended to evaluate how much additional RAM, storage, and nodes are needed, which depends on how many columns you index, the amount of text indexed, and any text filtering complexity (e.g., n-grams need more.) It could range from 25% increase for a relatively small number of indexed columns to 100% if all columns are indexed. Also, you need to have enough nodes so that the per-node Solr index fits in RAM or mostly in RAM if using SSD. And vnodes are not currently recommended for Solr data centers.

这篇关于什么时候在 DSE 中使用 Cassandra 和 Solr?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆