Cassandra是一个无模式的数据库,10万的订单表和每天数百万的查询 [英] Cassandra for a schemaless db, 10's of millions order tables and millions of queries per day

查看:147
本文介绍了Cassandra是一个无模式的数据库,10万的订单表和每天数百万的查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我建立了一个具有以下特征的数据库:


  1. 每行的列数可变的无模式数据库。 li>
  2. 数以百万计的记录和数十列。

  3. 每天查询数百万次。

  4. 每天进行数千次写入。

  5. 查询将筛选多个列(而不仅仅是键)。

我正在考虑Cassandra的规模。



我的问题是:


  1. 在这种情况下需要水平缩放?

  2. Cassandra是否支持多个键指向同一列族?




b
$ b

EDIT



我想确保我的观点是正确的。所以,下面的例子放下了我从你的答案:



所以,如果我们有以下列族(它包含一些商店产品及其详细信息) p>

  products //列族名称
{
x = {id :x,//这是行的唯一ID。
name:Laptop,
screen:15 inch,
OS:Windows}
y = {id:y ,//这是行的唯一ID。
name:Laptop,
screen:17 inch}
z = {id:z,//这是行的唯一ID。
name:Printer,
page per minute:20 pages}
}

并且,我们要添加name搜索参数,我们将使用不同的行键为CF创建另一个副本,如下所示:

  products 
{
x:name:Laptop= {id:x,
name:Laptop,
screen:15 inch,
OS:Windows}
y:name:Laptop= { :y,
name:笔记本电脑,
screen:17英寸}
z:name:Printer= {id:z ,
name:Printer,
ppm:20 pages}
}

同样,为了添加screen搜索参数:

  products 
{
x:screen:15 inch= {id:x
name:Laptop,
:15 inch,
OS:Windows}
y:screen:17 inch= {id:y,
name:笔记本电脑,
screen:17英寸}
}

但是,如果我们想基于10个搜索参数或它们的任何组合(如我的应用程序中的情况)进行查询,那么我们将必须创建1023个列族的副本[(2到10的幂) 1]。由于大多数行都有很多搜索参数,这意味着我们需要大约1000倍额外的存储来建模数据(以这种方式),这是不小的,特别是如果我们在原始CF中有10,000,000行。



这是您建议的数据模型吗?



>

解决方案

Cassandra不是一个非结构化的db可以通过除了行键以外的任何查询。



我们每天在6个Cassandra节点集群上执行175,000,000次查询(easy!),但我们只要求使用row_keys的数据和列,因为我们已经使我们的数据模型以这种方式工作。我们不使用索引查询。



为了支持更丰富的查询,我们使用我们将用作搜索参数的数据对数据进行反规范化,使键检索数据。 p>

示例:考虑我们保存以下对象:

  obj {
id:xxx //假设id是系统中的唯一ID
p1:value1
p2:value2
} b $ b

我们知道我们要通过这些参数进行搜索,然后我们将保存一份obj
for column_names或keys如下:

 p1:value1:xxx
p2:value2:xxx
p1:value1:p2:value2:xxx
xxx

这样,我们可以用p1 = value1,p2 = value2,p1 = value1和p2 = value2或只是它的唯一id xxx搜索obj。



唯一的另一个选项,如果你不想这样做是使用二级索引和索引查询,但会失去你的问题的无架构的要求。






编辑 - 例如。



保存对象Products定义为

  class Products {
string uid;
string name;
int screen_size; // in inches
string os;
string brand;
}

我们将它序列化成一个字符串或byteArray杰克逊杰森或Protobuf ...两个工作非常好,cassandra和超快)。
我们把那个字节数组放到一个列中。



现在重要的部分:创建列名和行键。
假设我们要根据屏幕分辨率进行搜索,并可能按品牌过滤。
我们将屏幕大小的存储桶定义为[0_to15,16_to_21,21_up]



给定列:

 {uid:MI615FMDO548,name:SFG-0098,screen_size:15,os:Android JellyBean ,品牌:三星} 

一份保存:
- key = brand:Samsungand column_name =screen_size:15_uid:MI615FMDO548
- key =brand:0_to_15and column_name =screen_size:15_uid:MI615FMDO548






示例第2部分现在让我们说,我们添加了

 {uid:MI615FMDO548 :SFG-0098,screen_size:15,os:Android JellyBean,品牌:Samsung}
{uid:MI615FMD5589,name:SFG-0097,screen_size:14,os :Android JellyBean,品牌:Samsung}
{uid:MI615FMD1111,name:SFG-0098,screen_size:17,os:Android JellyBean,品牌:Samsung}
{uid:MI615FMDO687,name:SFG-0095,screen_size:13,os:Android JellyBean,brand:Samsung}


我们最终会得到以下列族:

 产品{
-Row:brand:Samsung
=> screen_size:13_uid:MI615FMDO687:{uid:MI615FMDO687,name:SFG-0095,screen_size:13,os:Android JellyBean,brand:Samsung}
=> screen_size:14_uid:MI615FMD5589:{uid:MI615FMD5589,name:SFG-0097,screen_size:14,os:Android JellyBean,brand:Samsung}
=>screen_size :15_uid:MI615FMDO548:{uid:MI615FMDO548,name:SFG-0098,screen_size:15,os:Android JellyBean,brand:Samsung}
=>screen_size: 17_uid:MI615FMD1111:{uid:MI615FMD1111,name:SFG-0098,screen_size:17,os:Android JellyBean,brand:Samsung}
-Row:screen_size:0_to_15
=>品牌:Samsung_uid:MI615FMDO687:{uid:MI615FMDO687,名称:SFG-0095,screen_size:13,os:Android JellyBean,品牌:
=>brand:Samsung_uid:MI615FMD5589:{uid:MI615FMD5589,name:SFG-0097,screen_size:14,os:Android JellyBean,brand:Samsung}
=> brand:Samsung_uid:MI615FMDO548:{uid:MI615FMDO548,name:SFG-0098,screen_size:15,os:Android JellyBean,brand:Samsung}
-Row: screen_size:16_to_17
=> brand:Samsung_uid:MI615FMD1111:{uid:MI615FMD1111,name:SFG-0098,screen_size:17,os:Android JellyBean,品牌:Samsung}
-Row: uid:MI615FMDO687
=> product:{uid:MI615FMDO687,name:SFG-0095,screen_size:13,os:Android JellyBean,brand:Samsung}
-Row:uid:MI615FMD5589
=> product:{uid:MI615FMD5589,name:SFG-0097,screen_size:14,os:Android JellyBean,brand:Samsung}
-Row:uid:MI615FMDO548
=>product:{uid:MI615FMDO548,name:SFG-0098,screen_size:15,os:Android JellyBean,brand:Samsung}
:uid:MI615FMD1111
=>product:{uid:MI615FMD1111,name:SFG-0098,screen_size:17,os:Android JellyBean,brand:Samsung}
}



现在通过对列名称使用范围查询,您可以按品牌和





希望这是有用的


I am building a database, with the following characteristics:

  1. Schemaless database with a variable number of columns for each row.
  2. Tens of millions of records and tens of columns.
  3. Millions queries per day.
  4. Thousands writes per day.
  5. Queries will be filtering on several columns (not only the key).

I am considering Cassandra which is built-to-scale.

My questions are:

  1. Do I need to scale horizontally in this case?
  2. Does Cassandra support having several keys to point to the same column-family?


EDIT

I would like to make sure that I got your point right. So, the following example puts down what I got from your answer:

So, if we have the following column family (it holds some store products and their details)

products // column-family name
{
x = {   "id":"x", // this is unique id for the row. 
    "name":"Laptop",
    "screen":"15 inch",
    "OS":"Windows"}
y = {   "id":"y", // this is unique id for the row. 
    "name":"Laptop",
    "screen":"17 inch"}
z = {   "id":"z", // this is unique id for the row. 
    "name":"Printer",
    "page per minute":"20 pages"}
}

And, we want to add "name" search parameter, we will create another copy of the CF with different row keys as the following:

products
{
"x:name:Laptop"  = {    "id":"x", 
            "name":"Laptop",
            "screen":"15 inch",
            "OS":"Windows"}
"y:name:Laptop"  = {    "id":"y", 
            "name":"Laptop",
            "screen":"17 inch"}
"z:name:Printer" = {    "id":"z", 
            "name":"Printer",
            "ppm":"20 pages"}
}

And similarly, in order to add the "screen" search parameter:

products
{
"x:screen:15 inch" = {  "id":"x" 
            "name":"Laptop",
            "screen":"15 inch",
            "OS":"Windows"}
"y:screen:17 inch" = {  "id":"y", 
            "name":"Laptop",
            "screen":"17 inch"}
}

But, if we would like to make a query based on 10 search parameters or any combination of them (as the case in my application), then we would have to create 1023 copies of the column family [(2 to the power 10)-1]. And since most of the rows will have many of the search parameters, this means that we need about 1000 times extra storage to model the data (in this way), which is not little, especially if we have 10,000,000 rows in the original CF.

Is this the data model you suggested?


Another point: I don't manage to see exactly why creating secondary indexes would forfeit or deprive the schemaless model.

解决方案

Cassandra is not a db you can query by anything other than the row key. But you can tailor your datamodel to support those queries.

We do 175,000,000 queries a day on our 6 cassandra nodes cluster (easy!) but we only ask for data using row_keys and columns because we have made our datamodel to work that way. We do not use indexed queries.

To support richer queries we denormalize our data using the data we will use as search parameters for making the keys to retrieve the data.

Example: Consider we save the following object:

obj {
   id : xxx //assuming id is a unique id across the system
   p1 : value1
   p2 : value2
}

And we know we want to search by any of those parameters then we will save a copy of obj for column_names or keys as follows:

"p1:value1:xxx"
"p2:value2:xxx"
"p1:value1:p2:value2:xxx" 
"xxx"

This way we can search for obj with p1 = value1, p2 =value2, p1 = value1 AND p2 = value2 or by just it's unique id xxx.

The only other option if you do not want to do that is to use Secondary indexes and indexed queries but that would forfeit the "schema-less" requirement of your question.



EDIT - An example.

We want to save objects "Products" defined as

class Products{
    string uid;
    string name;
    int screen_size; //in inches
    string os;
    string brand;
}

We serialize it into a string or byteArray (I always have the tendency of using Jackson Json or Protobuf ... both work very well with cassandra and are super fast). We put that byte array into a column.

Now the important part : creating the column names and the row keys. Let's say we want to search by screen resolutions and possibly filter by brands. We define buckets for the screen size as ["0_to15", "16_to_21", "21_up"]

given column :

"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}

one copy get saved with: - key = "brand:Samsung" and column_name = "screen_size:15_uid:MI615FMDO548" - key = "brand:0_to_15" and column_name = "screen_size:15_uid:MI615FMDO548"

Why do I add the uid to the column name? To make all column names unique for unique products.


Example part 2 Now lets say we have added

"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"


We will end up with the following column family:

Products{
-Row:"brand:Samsung"
=> "screen_size:13_uid:MI615FMDO687":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
=> "screen_size:14_uid:MI615FMD5589":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
=> "screen_size:15_uid:MI615FMDO548":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
=> "screen_size:17_uid:MI615FMD1111":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
-Row:"screen_size:0_to_15"
=> "brand:Samsung_uid:MI615FMDO687":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
=> "brand:Samsung_uid:MI615FMD5589":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
=> "brand:Samsung_uid:MI615FMDO548":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
-Row:"screen_size:16_to_17"
=> "brand:Samsung_uid:MI615FMD1111":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMDO687"
=> "product":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMD5589"
=> "product":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
-Row:"uid:MI615FMDO548"
=> "product":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMD1111"
=> "product":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
}

Now by using range queries across column names you can search by brand and by screen size.



hope this was useful

这篇关于Cassandra是一个无模式的数据库,10万的订单表和每天数百万的查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆