卡桑德拉全文搜索 [英] Cassandra Full-Text Search

查看:69
本文介绍了卡桑德拉全文搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Cassandra中进行全文本搜索;

Full-Text search in Cassandra;

我对Cassandra还是陌生的,希望能更正确地理解它。我正在尝试在Cassandra中执行全文本搜索,但是经过一些研究,我发现 可能不是一种简单的方法..我之所以说是因为,谷歌什么也没说。

I am fairly new to Cassandra, and wish to understand it more properly. I am attempting to perform a Full-Text search in Cassandra, but after some research I have found that there may not be a "simple" approach for this.. and I say maybe because the first page of Google hasn't said much of anything.

所以我现在想了解的是,这里最好的方法是什么。这种情况使我根据自己的想法做出自己的假设。到目前为止,基于这两个原理,我们已经了解了有关Cassandra的知识; a)根据查询而不是数据来设计表,并且b)只要正确使用数据,多数据就是一件好事。

So I am trying to understand now instead, what is the best approach here.. This sort of lead me to take make up my own assumptions based on what I've learned so far about Cassandra, that is based on these two principals; a) design your tables based on your queries, rather than the data, and b) more-data is a good thing, as long as it is being used properly.

话虽如此,我想出了一些我想分享的解决方案,并且还要求如果有人有更好的主意,请在我做出任何不合理/天真的事情之前先填好它。

With that being said, I've come up with a couple of solutions I'd like to share, and also ask that if anyone has a better idea, please fill me on it before I commit to anything unreasonable/naive.

第一个解决方案:创建一个带有两个主键和一个索引的列族(CF),如下所示:

First Solution: Create a Column Family(CF), with two primary keys and an Index like so:

CREATE TABLE "FullTextSearch" (
"PartialText" text,
"TargetIdentifier" uuid,
"CompleteText" text,
"Type" int,
PRIMARY KEY ("PartialText","TargetIdentifier")
);
CREATE INDEX IX_FullTextSearch_Type "keyspace"."FullTextSearch" ("Type");

使用上表,我需要为文本 Hello World插入行,如下所示:

With the above table, I would need to insert rows for the text "Hello World" as follows:

BATCH APPLY;
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("H",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("He",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hel",000000000-0000-0000-0000-000000000,"Hello World",1);
.....
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello Wor",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello Worl",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello World",000000000-0000-0000-0000-000000000,"Hello World",1);
.....
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Wor",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Worl",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("World",000000000-0000-0000-0000-000000000,"Hello World",1);
END BATCH;

基本上,以上内容将满足以下通配符/部分文本%o W%, Hello% ,世界百分比;但是,它不能满足部分单词,例如 Hello的%ell%,我现在对此感觉还不错.....(OCD sorta出现在这里)

Basically, the above will satisfy the following wildcards/partialtext "%o W%", "Hello%", "Worl%"; However it will not satisfy partial words such as "%ell%" for "Hello", which I can feel alright about for now..... (OCD sorta kicks in here)

这种方法对我来说很糟糕,因为现在只要在 TargetIdentifier上进行保存/名称更改,我都必须删除/重新插入;

This approach sort of sucks for me because I would now have to delete/re-insert any time a save/name change occurs on the "TargetIdentifier";

第二种解决方案,只是这次使用宽列;该表可能看起来像这样:

The Second Solution, would be very similar only this time making use of wide-columns; where the table might look like:

CREATE TABLE "FullTextSearch" (
"TargetIdentifier" uuid,
"Type" int,
"CompleteText" text,
PRIMARY KEY("TargetIdentifier")
);

现在在搜索过程中,诸如:

and now during a search something like:

SELECT * FROM "FullTextSearch" WHERE "He" = 1;

因此,如果该列存在,则返回相应的行;

so that if the column exists, the respective rows are returned;

第三种解决方案:
与上面的类似,只是这次而不是使用宽列,我们使用set列(例如map)作为部分文本,并执行如下查询:

Third Solution: similar to the one above, only this time instead of using wide-columns we use a set column such as map for the partial texts, and perform a query like:

SELECT * FROM "FullTextSearch" WHERE "PartialTexts"['He'] = 1;

无论如何,我都没主意了,已经晚了,我只能希望有一个伟大的响应!请让我知道我应该在这里做什么...我什至在正确的道路上吗?

Anyways, I am all out of ideas, it is late, and I can only hope for a great response! Please, let me know what I should be doing here... am I even on the right path?

推荐答案

AFAIK Datastax Enterprise搜索是Solandra的(商业)继承人。

AFAIK Datastax Enterprise Search is the (commercial) successor of Solandra.

Cassandra 2.0支持所谓的自定义二级索引。
自定义二级索引是Java代码。您自己的实现必须实现抽象类 org.apache.cassandra.db.index.SecondaryIndex
(请参见 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_index_r.html

Cassandra 2.0 supports so called "custom secondary indexes". Custom secondary indexes are Java code. Your own implementation has to implement the abstract class org.apache.cassandra.db.index.SecondaryIndex (See http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_index_r.html)

我不确定是否存在Elasticsearch或Solr的实现。

I'm not sure whether implementations exist for Elasticsearch or Solr.

我不建议来编码所有奇怪的全文本搜索逻辑,例如词干,多种/外来语言支持甚至地理空间内容。

I would not recommend to code all the weird full text search logic like stemming, multiple/exotic language support or even geo spatial stuff.

但是 SecondaryIndex 开始集成您喜欢的搜索引擎将是一个很好的方法。

But SecondaryIndexwould be a good point to start integrating your favorite search engine.

这篇关于卡桑德拉全文搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆