何时在DSE中使用Cassandra与Solr? [英] When to use Cassandra vs. Solr in DSE?

查看:299
本文介绍了何时在DSE中使用Cassandra与Solr?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用DSE进行Cassandra / Solr集成,以便将数据存储在Cassandra中并在Solr中建立索引。使用Cassandra来处理CRUD操作和使用Solr分别进行全文本搜索是非常自然的,DSE可以真正简化Cassandra和Solr之间的数据同步。

I'm using DSE for Cassandra/Solr integration so that data are stored in Cassandra and indexed in Solr. It's very natural to use Cassandra to handle CRUD operation and use Solr for full text search respectively, and DSE can really simplify data synchronization between Cassandra and Solr.

然而,当涉及到查询时,实际上有两种方法:Cassandra secondary / manual配置的索引与Solr。我想知道什么时候使用哪种方法和一般的性能差异,特别是在DSE设置下。

When it comes to query, however, there are actually two ways to go: Cassandra secondary/manual configured index vs. Solr. I want to know when to use which method and what's the performance difference in general, especially under DSE setup.

这是我的项目中的一个示例用例。我有一个Cassandra表存储一些项实体数据。除了基本的CRUD操作,我还需要在一些字段(例如类别)上检索项目,然后按照一些顺序(在我这里,一个like_count字段)进行排序。

Here is one example use case in my project. I have a Cassandra table storing some item entity data. Besides the basic CRUD operation, I also need to retrieve items by equality on some field (say category) and then sort by some order (in my case here, a like_count field).

我可以想出三种不同的处理方式:

I can think of three different ways to handle it:


  1. 在Solr架构中声明'indexed = true'

  2. 在Cassandra中创建非规范化表格
  3. 在Cassandra中使用主键(类别,主键(类别,顺序,ID),并使用外部组件(例如Spark / Storm)通过like_count对项目进行排序

  1. Declare 'indexed=true' in Solr schema for both category and like_count field and query in Solr
  2. Create a denormalized table in Cassandra with primary key (category, like_count, id)
  3. Create a denormalized table in Cassandra with primary key (category, order, id) and use an external component, such as Spark/Storm,to sort the items by like_count

第一种方法似乎是最简单的实现和维护。我只是编写一些小的Solr访问代码,其余繁重的工作由Solr / DSE搜索处理。

The first method seems to be the simplest to implement and maintain. I just write some trivial Solr accessing code and the rest heavy lifting are handled by Solr/DSE search.

第二种方法需要在创建和更新时进行手动非规范化。我还需要维护一个单独的表。还有墓碑问题,因为like_count可能经常更新。好的部分是,读取可以更快(如果没有过多的墓碑)。

The second method requires manual denormalization on create and update. I also need to maintain a separate table. There is also tombstone issue as the like_count can possibly be updated frequently. The good part is that the read may be faster (if there are no excessive tombstones).

第三种方法可以以一个用于排序的额外组件为代价来减轻墓碑问题。

The third method can alleviate the tombstone issue at the cost of one extra component for sorting.

你认为哪种方法是最好的选择?性能有什么区别?

Which method do you think is the best option? What is the difference in performance?

推荐答案

Cassandra次要索引的用例有限:

Cassandra secondary indexes have limited use cases:


  1. 不超过几个列索引。

  2. 查询中只有一个索引列。

  3. 节点间高基数数据流量(相对唯一的列值)

  4. 低基数数据的节点间流量过多(高百分比的行会匹配)


  1. No more than a couple of columns indexed.
  2. Only a single indexed column in a query.
  3. Too much inter-node traffic for high cardinality data (relatively unique column values)
  4. Too much inter-node traffic for low cardinality data (high percentage of rows will match)
  5. Queries need to be known in advance so data model can be optimized around them.

由于这些限制,应用程序通常需要查询以创建索引表,其由期望的任何列索引。这需要将数据从主表复制到每个索引表,或者需要额外的查询来读取索引表,然后在从索引表读取主键之后从主表中读取实际行。对多个列的查询必须提前手动编入索引,使特别查询成为问题。任何重复的都必须由应用程序手动更新到每个索引表。

Because of these limitations, it is common for apps to create "index tables" which are indexed by whatever column is desired. This requires either that data be duplicated from the main table to each index table, or an extra query will be needed to read the index table and then read the actual row from the main table after reading the main key from the index table. Queries on multiple columns will have to be manually indexed in advance, making ad hoc queries problematic. And any duplicated will have to be manually updated by the app into each index table.

除此之外,他们将工作正常的情况下,一个适度的数字

Other than that... they will work fine in cases where a "modest" number of rows will be selected from a modest number of nodes, and queries are well specified in advance and not ad hoc.

DSE / Solr更适合:

DSE/Solr is better for:


  1. 适当数量的列被编入索引。

  2. 复杂查询具有多个引用的列/字段查询中的所有指定字段。 Lucene对每个节点上的数据进行索引,因此节点并行查询。

  3. 一般来说,临时查询的精确查询不为预先知道。

  4. 丰富的文字查询,例如关键字搜索,通配符,模糊/喜欢,范围,不等式。

  1. A moderate number of columns are indexed.
  2. Complex queries with a number of columns/fields referenced - Lucene matches all specified fields in a query in parallel. Lucene indexes the data on each node, so nodes query in parallel.
  3. Ad hoc queries in general, where the precise queries are not known in advance.
  4. Rich text queries such as keyword search, wildcard, fuzzy/like, range, inequality.

使用Solr索引,因此建议使用概念验证实现来评估需要多少额外的RAM,存储和节点,这取决于索引的列数,索引的文本量和任何文本过滤的复杂性(例如, n-gram需要更多。)其范围可以从相对少量的索引列的25%增加到如果所有列索引的100%。此外,您需要有足够的节点,以便每个节点的Solr索引适合在RAM或大多在RAM中,如果使用SSD。目前不推荐对Solr数据中心使用vnode。

There is a performance and capacity cost to using Solr indexing, so a proof of concept implementation is recommended to evaluate how much additional RAM, storage, and nodes are needed, which depends on how many columns you index, the amount of text indexed, and any text filtering complexity (e.g., n-grams need more.) It could range from 25% increase for a relatively small number of indexed columns to 100% if all columns are indexed. Also, you need to have enough nodes so that the per-node Solr index fits in RAM or mostly in RAM if using SSD. And vnodes are not currently recommended for Solr data centers.

这篇关于何时在DSE中使用Cassandra与Solr?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆