MySQL数据模型到Cassandra帮助? [英] MySQL Data Model to Cassandra Help?

查看:126
本文介绍了MySQL数据模型到Cassandra帮助?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将RDBMS模型移动到Cassandra,并且很难创建模式。这是我的数据模型:

  CREATE TABLE域(
ID INT NOT NULL PRIMARY KEY,
DomainName NVARCHAR(74)NOT NULL,
HasBadWords BIT,
...
);
INSERT INTO域(DomainName,HasBadWords)VALUES('domain1.com',0);
INSERT INTO域(DomainName,HasBadWords)VALUES('domain2.com',0);

CREATE TABLE ZoneFile(
ID INT NOT NULL PRIMARY KEY,
DomainID INT NOT NULL,
可用BIT NOT NULL,
名称服务器NVARCHAR(MAX) ,
Timestamp DATETIME NOT NULL
);
INSERT INTO ZoneFile(DomainID,Available,Nameservers,Timestamp)VALUES(1,0,ns1,'2010-01-01');
INSERT INTO ZoneFile(DomainID,Available,Nameservers,Timestamp)VALUES(2,0,ns1,'2010-01-01');
INSERT INTO ZoneFile(DomainID,Available,Nameservers,Timestamp)VALUES(1,1,ns2,'2011-01-01');
INSERT INTO ZoneFile(DomainID,Available,Nameservers,Timestamp)VALUES(2,1,ns2,'2011-01-01');

CREATE TABLE反向链接(
ID INT NOT NULL PRIMARY KEY,
DomainID INT NOT NULL,
反向链接INT NOT NULL,
索引INT NOT NULL,
Timestamp DATETIME NOT NULL
);
INSERT INTO反向链接(DomainID,反向链接,索引,时间戳)VALUES(1,100,200,'2010-01-01');
INSERT INTO反向链接(DomainID,反向链接,索引,时间戳)VALUES(2,300,600,'2010-01-01');
INSERT INTO反向链接(DomainID,反向链接,索引,时间戳)VALUES(1,500,1000,'2010-01-01');
INSERT INTO反向链接(DomainID,反向链接,索引,时间戳)VALUES(2,600,1200,'2010-01-01');

从这里,我推断我可能有一个Keyspace:DomainData。在这个键空间中,我可以有一个名为Domain的列系列,它就像我的sql中的域表:

 Domain {// ColumnFamily 
domain1.com:{HasBadWords:0},// SuperColumn
domain2.com:{HasBadWords:0} // SuperColumn
}

下面的表是我开始困惑的地方。 ZoneFile和反向链接本质上应该存储查找每个域的这些值的结果的历史。所以,一个域到许多ZoneFile记录。为了查询目的,我想要能够轻松获取最新的ZoneFile记录,或给定的域。我将需要为Backlinks做同样的。



我正在考虑这样做,并对域的键进行范围查找,然后获取最后记录,应该是最新的时间戳。 。

 ZoneFiles:{// ColumnFamily 
domain1.com:2010-01-01 12:00 :00.000:{Available:0,Nameservers:ns1},// SuperColumn
domain1.com:2011-01-01 12:00:00.000:{Available:1 ,Nameservers:ns2},// SuperColumn
domain2.com:2010-01-01 12:00:00.000:{Available:0,Nameservers:ns1}, // SuperColumn
domain2.com:2011-01-01 12:00:00.000:{可用:1,Nameservers:ns2} // SuperColumn
}

我不相信这是正确的答案,一个字符串域和字符串datetime在一个键的组合感觉错误。



假设我是一个人,我可以指点我的方向吗?



使用:

 ZoneFiles:{
domain1.com:{
timestamp1:{ \available\:1,\nameservers\:\ns1\},
timestamp2:{\available\:1,\nameservers \:\ns1\},
}
}

如何查询最新时间戳早于给定日期的域行列表?

解决方案

正确地理解你的问题,你想在这个模型上做的唯一查询是请给我最新的zonefile或给定域的反向链接?



如果是这样,我会将这些的最新值存储在域列系列中的域行下的单独列中。我还会存储这个最新的值被更新(时间戳)。每当你获得zonefile和反向链接中的信息的新值时,我将只覆盖域列系列中的值,并更新时间戳。



也保持这个历史数据,所以你可以查询它,我认为这种查询将是显示给定域的两次之间的所有更新(这是正确的?)。如果是这样,我不会像这样手动构造复合行键,因为它将要求您使用Order Preserving Partitioner从get_range_slices获取正确的结果。您可能知道,使用OPP的负载平衡可能是一个困难的任务。



相反,我将行键为域ID,列键为更新的时间戳。然后,您可以将更新打包为单个值(例如使用json),使用超级列或在0.8中使用新的复合键。如果这样做,您可以使用get_slice来满足您的查询,并且它将与随机分区程序正确地运行,使负载平衡更容易。



Tom Wilkie | Acunu | www.acunu.com | @tom_wilkie



回复评论:如何查询最新的zonefile时间戳记列早于给定时间戳的域列表?



您可以插入另一栏系列:

  row key:day(or hour,or some other reasonable'bucketing')
column key:update的时间戳
value:domain

...每次更新zonefile。然后,要获取自t以来最近更新的域,请执行:

  result = [] 
(t)... day(now):
result.extend(get_slice(i,range(t,'')))

这将要求您从结果中删除重复条目,因此只有在t很近时才能工作得最好。您还必须考虑写入的负载平衡,这将使所有负载集中在单个服务器上(因为在任何时候,您只插入一行)



如果这些折衷不合适,那么您可以查看hadoop集成并使用它来执行此查询。或者你可以做出其他折衷(使用OPP,或在写入之前读取,以删除重复的,这将是v。慢)


I'm trying to move a RDBMS model over to Cassandra, and having a hard time creating the schema. Here is my data model:

CREATE TABLE Domain (
    ID INT NOT NULL PRIMARY KEY,
    DomainName NVARCHAR(74) NOT NULL,
    HasBadWords BIT,
    ...
);
INSERT INTO Domain (DomainName, HasBadWords) VALUES ('domain1.com', 0);
INSERT INTO Domain (DomainName, HasBadWords) VALUES ('domain2.com', 0);

CREATE TABLE ZoneFile (
    ID INT NOT NULL PRIMARY KEY,
    DomainID INT NOT NULL,
    Available BIT NOT NULL,
    Nameservers NVARCHAR(MAX),
    Timestamp DATETIME NOT NULL
);
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (1, 0, "ns1", '2010-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (2, 0, "ns1", '2010-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (1, 1, "ns2", '2011-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (2, 1, "ns2", '2011-01-01');

CREATE TABLE Backlinks (
    ID INT NOT NULL PRIMARY KEY,
    DomainID INT NOT NULL,
    Backlinks INT NOT NULL,
    Indexed INT NOT NULL,
    Timestamp DATETIME NOT NULL
);
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (1, 100, 200, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (2, 300, 600, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (1, 500, 1000, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (2, 600, 1200, '2010-01-01');

From this, I've deduced that I can probably have one Keyspace: DomainData. In this keyspace, I can have a columnfamily called "Domain" which is like my Domain table in sql:

"Domain" : { //ColumnFamily
    "domain1.com" : { "HasBadWords" : 0 }, //SuperColumn
    "domain2.com" : { "HasBadWords" : 0 }  //SuperColumn
}

The next tables are where I start getting confused. ZoneFile and Backlinks are essentially supposed to store a history of results from looking up these values for each domain. So, one Domain to Many ZoneFile records. For querying purposes, I want to be able to easily get the 'newest' ZoneFile record, or a given Domain. I will need to do the same for Backlinks.

I was considering something like this, and doing a range lookup on the key for the domain, and then getting the 'last' record which should be the newest timestamp...

"ZoneFiles" : { //ColumnFamily
    "domain1.com:2010-01-01 12:00:00.000" : { "Available" : 0, "Nameservers" : "ns1" }, //SuperColumn
    "domain1.com:2011-01-01 12:00:00.000" : { "Available" : 1, "Nameservers" : "ns2" }, //SuperColumn
    "domain2.com:2010-01-01 12:00:00.000" : { "Available" : 0, "Nameservers" : "ns1" }, //SuperColumn
    "domain2.com:2011-01-01 12:00:00.000" : { "Available" : 1, "Nameservers" : "ns2" }  //SuperColumn
}

I'm not convinced this is the right answer, the combination of a string domain and string datetime in a key feels wrong. Could someone point me in the right direction?

EDIT:

Assuming I use:

"ZoneFiles" : {
  "domain1.com" : {
    timestamp1 : "{\"available\":1,\"nameservers\":\"ns1\"}",
    timestamp2 : "{\"available\":1,\"nameservers\":\"ns1\"}",
  }
}

How would I query a list of domain rows where the newest timestamp is older than a given date?

解决方案

If I understand your question correctly, the only query you want to do on this model is "please get me the latest zonefile or backlinks for a given domain" ?

If thats the case, I would store the latest values for these in the "Domain" column family, under the domains row key, in separate columns. I would also store when this latest value was updated (the timestamp). Every time you get new values for the info in zonefile and backlinks, I would just overwrite the value in the "Domain" column family and update the timestamp.

I assume you are also keeping this historical data so you can query it, and I assume the kind of query will be "show me all the updates for a given domain between two times" (is this correct?). If so, I wouldn't manually construct a composite row key like that, since it will require you to use the Order Preserving Partitioner to get the correct results from get_range_slices. And as you probably know, load balancing with the OPP can be a difficult task.

Instead, I would have the row key be domain id, and the column key be the timestamp of the update. Then you can either pack you updates into a single value (eg using json), use super columns or use the new composite keys in 0.8. If done like this, you can use a get_slice to satisfy your query, and it will behave correctly with the Random Partitioner, making load balancing much easier.

Tom Wilkie | Acunu | www.acunu.com | @tom_wilkie

Reply to comment: "how would I query a list of domains that's most recent zonefile timestamp column is older than a given timestamp?"

You could do that by inserting into another column family:

row key: day (or hour, or some other reasonable 'bucketing') 
column key: timestamp of update 
value: domain

...every time you update the zonefile. Then, to get the most recently updated domains since t, do:

result = []
for i in day(t) ... day(now):
    result.extend(get_slice(i, range(t, '')))

This would require you to remove repeat entries from result, so would only work best when t is pretty recent. You also have to consider the load balancing for the writes, which would focus all the load on a single server (since, at any one time, you are inserting into only one row)

If these trade offs aren't appropriate, then you could look at the hadoop integrations and use that to perform this query. Or you could make other tradeoff (use the OPP, or do a read before a write to remove the duplicates, which would be v. slow)

这篇关于MySQL数据模型到Cassandra帮助?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆