MySQL和NoSQL:帮助我选择正确的一个 [英] MySQL and NoSQL: Help me to choose the right one

查看:147
本文介绍了MySQL和NoSQL:帮助我选择正确的一个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个大数据库,1,000,000,000行,称为线程(这些线程实际上存在,我不会使事情更难,只因为我喜欢它)。 Threads只有几个东西,以使事情更快:(int id,string hash,int replycount,int dateline(timestamp),int forumid,string title)



Query:



select * from thread where forumid = 100 and replycount> 1 order by dateline desc limit 10000,100



由于有1G的记录,这是一个相当慢的查询。所以我想,让我们拆分这个1G的记录在许多表中的许多论坛(类)我有!这几乎是完美的。有很多表我有更少的记录搜索,它的真的更快。现在查询变为:



select * from thread_ {forum_id} where replycount> 1 order by dateline desc limit 10000,100

$



99%的论坛(类别)真的更快,因为大多数论坛只有几个主题(100k-1M)。但是因为有一些约10M的记录,一些查询仍然是慢(0.1 / .2秒,对于我的应用程序, strong>)。



我不知道如何使用MySQL改进这个。是否有办法?



对于这个项目,我将使用10个服务器(12GB RAM,4x7200rpm硬盘,软件RAID 10,四核)



这个想法是简单地在服务器之间拆分数据库,但是上面解释的问题仍然不够。



如果我安装cassandra在这10个服务器上(假设我发现时间使它工作,因为它是应该的)我应该假设有一个性能提升吗?



我该怎么办?继续使用MySQL和多个机器上的分布式数据库或构建一个cassandra集群?



我被要求发布什么是索引它们是:

  mysql>显示索引在线程; 
PRIMARY ID
forumid
dateline
replycount

选择解释:

  mysql>解释SELECT * FROM thread WHERE forumid = 655 AND visible = 1 AND open<> 10 ORDER BY dateline ASC LIMIT 268000,250; 
+ ---- + ------------- + -------- + ------ + ---------- ----- + --------- + --------- + ------------- + -------- + - ---------------------------- +
| id | select_type |表|类型| possible_keys |键| key_len | ref |行|额外|
+ ---- + ------------- + -------- + ------ + ---------- ----- + --------- + --------- + ------------- + -------- + - ---------------------------- +
| 1 | SIMPLE |线程| ref | forumid | forumid | 4 | const,const | 221575 |使用where;使用filesort |
+ ---- + ------------- + -------- + ------ + ---------- ----- + --------- + --------- + ------------- + -------- + - ---------------------------- +

$ b $你应该阅读以下内容,并了解一个设计良好的innodb表的优点,以及如何最好地使用聚簇索引 - 只有innodb才能使用!



http:/ /dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html



http://www.xaprb.com/blog/2006/07/04/how-to- exploit-mysql-index-optimizations /



然后根据以下简化示例设计系统:



示例模式(简化)



重要的特性是表使用innodb引擎,并且线程表的主键不再是单个auto_incrementing键,但是基于forum_id和thread_id的组合的复合集群键。例如

 主线程(forum_id,thread_id)

forum_id thread_id
== ====== =========
1 1
1 2
1 3
1 ...
1 2058300
2 1
2 2
2 3
2 ...
2 2352141
...

每个论坛行包括一个名为next_thread_id(unsigned int)的计数器,它由触发器维护,并在每次将线程添加到给定论坛时递增。这也意味着如果使用thread_id的单个auto_increment主键,每个论坛可以存储40亿个线程,而不是总共40亿个线程。

  forum_id title next_thread_id 
======== ===== ==============
1个论坛1 2058300
2论坛2 2352141
3个论坛3 2482805
4个论坛4 3740957
...
64个论坛64 3243097
65个论坛65 15000000 - ooh a big one
66论坛66 5038900
67论坛67 4449764
...
247论坛247 0 - 仍然加载一半的论坛数据!
248论坛248 0
249论坛249 0
250论坛250 0

使用复合键的缺点是,您不能再通过单个键值选择一个线程,如下所示:

  select * from threads where thread_id = y; 

您必须:

  select * from threads where forum_id = x and thread_id = y; 

但是,您的应用程序代码应该知道用户正在浏览哪个论坛,实现 - 将当前查看的forum_id存储在会话变量或隐藏表单字段中...



这是简化的模式:

  drop table如果存在forums; 
创建表论坛

forum_id smallint unsigned not null auto_increment主键,
标题varchar(255)唯一不为空,
next_thread_id int unsigned not null默认值0 - - 每个论坛中的线程数
)engine = innodb;


删除表如果存在threads;
create table threads

forum_id smallint unsigned not null,
thread_id int unsigned not null默认值0,
reply_count int unsigned not null默认值0,
hash(32)not null,
created_date datetime not null,
主键(forum_id,thread_id,reply_count) - 复合聚集索引
)engine = innodb;

delimiter#

创建触发threads_before_ins_trig之前插入线程
每行
begin
declare v_id int unsigned default 0;

select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
set new.thread_id = v_id;
更新论坛集next_thread_id = v_id其中forum_id = new.forum_id;
end#

delimiter;

您可能已经注意到我已将reply_count作为主键的一部分,这是有点奇怪(forum_id,thread_id)组合本身是唯一的。这只是一个索引优化,它在执行使用reply_count的查询时节省了一些I / O。



查询示例



我仍在加载数据到我的示例表,到目前为止我有一个加载约。 5亿行(系统的一半)。当加载过程完成时,我应该有约:

  250论坛* 5百万线程= 1250 000 000行)

我故意让一些论坛包含超过500万个线程, 65有15百万个主题:

  forum_id title next_thread_id 
======== ===== ==============
65 forum 65 15000000 - ooh a big one



查询运行时



 从论坛选择sum(next_thread_id) 

sum(next_thread_id)
===================
539,155,433(5亿线程到目前为止仍在增长。 ..)

在innodb中对next_thread_ids进行求和以得到总线程数比通常快得多:

 从线程中选择count(*); 

论坛65有多少主题:

 从forums中选择next_thread_id forum_id = 65 

next_thread_id
============
15,000,000(15百万)

再次比通常快:

  select count(*)from forum where forum_id = 65 

好的,现在我们知道到目前为止我们有大约5亿个线程,论坛65有1500万个线程 - 让我们看看架构是如何执行的)

  select forum_id,thread_id from threads where forum_id = 65 and reply_count> 64 order by thread_id desc limit 32; 

runtime = 0.022 secs

选择forum_id,thread_id from forum where forum_id = 65 and reply_count> 1 order by thread_id desc limit 10000,100;

runtime = 0.027 secs

看起来很高效 - 单一表格,500万行(并且增长),查询在0.02秒内覆盖1500万行(在加载下!)



进一步优化



这些包括:


>

等...



希望您觉得此答案有帮助:)


There is a big database, 1,000,000,000 rows, called threads (these threads actually exist, I'm not making things harder just because of I enjoy it). Threads has only a few stuff in it, to make things faster: (int id, string hash, int replycount, int dateline (timestamp), int forumid, string title)

Query:

select * from thread where forumid = 100 and replycount > 1 order by dateline desc limit 10000, 100

Since that there are 1G of records it's quite a slow query. So I thought, let's split this 1G of records in as many tables as many forums(category) I have! That is almost perfect. Having many tables I have less record to search around and it's really faster. The query now becomes:

select * from thread_{forum_id} where replycount > 1 order by dateline desc limit 10000, 100

This is really faster with 99% of the forums (category) since that most of those have only a few of topics (100k-1M). However because there are some with about 10M of records, some query are still to slow (0.1/.2 seconds, to much for my app!, I'm already using indexes!).

I don't know how to improve this using MySQL. Is there a way?

For this project I will use 10 Servers (12GB ram, 4x7200rpm hard disk on software raid 10, quad core)

The idea was to simply split the databases among the servers, but with the problem explained above that is still not enought.

If I install cassandra on these 10 servers (by supposing I find the time to make it works as it is supposed to) should I be suppose to have a performance boost?

What should I do? Keep working with MySQL with distributed database on multiple machines or build a cassandra cluster?

I was asked to post what are the indexes, here they are:

mysql> show index in thread;
PRIMARY id
forumid
dateline
replycount

Select explain:

mysql> explain SELECT * FROM thread WHERE forumid = 655 AND visible = 1 AND open <> 10 ORDER BY dateline ASC LIMIT 268000, 250;
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
| id | select_type | table  | type | possible_keys | key     | key_len | ref         | rows   | Extra                       |
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
|  1 | SIMPLE      | thread | ref  | forumid       | forumid | 4       | const,const | 221575 | Using where; Using filesort | 
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+

解决方案

You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

then design your system something along the lines of the following simplified example:

Example schema (simplified)

The important features are that the tables use the innodb engine and the primary key for the threads table is no longer a single auto_incrementing key but a composite clustered key based on a combination of forum_id and thread_id. e.g.

threads - primary key (forum_id, thread_id)

forum_id    thread_id
========    =========
1                   1
1                   2
1                   3
1                 ...
1             2058300  
2                   1
2                   2
2                   3
2                  ...
2              2352141
...

Each forum row includes a counter called next_thread_id (unsigned int) which is maintained by a trigger and increments every time a thread is added to a given forum. This also means we can store 4 billion threads per forum rather than 4 billion threads in total if using a single auto_increment primary key for thread_id.

forum_id    title   next_thread_id
========    =====   ==============
1          forum 1        2058300
2          forum 2        2352141
3          forum 3        2482805
4          forum 4        3740957
...
64        forum 64       3243097
65        forum 65      15000000 -- ooh a big one
66        forum 66       5038900
67        forum 67       4449764
...
247      forum 247            0 -- still loading data for half the forums !
248      forum 248            0
249      forum 249            0
250      forum 250            0

The disadvantage of using a composite key is that you can no longer just select a thread by a single key value as follows:

select * from threads where thread_id = y;

you have to do:

select * from threads where forum_id = x and thread_id = y;

However, your application code should be aware of which forum a user is browsing so it's not exactly difficult to implement - store the currently viewed forum_id in a session variable or hidden form field etc...

Here's the simplified schema:

drop table if exists forums;
create table forums
(
forum_id smallint unsigned not null auto_increment primary key,
title varchar(255) unique not null,
next_thread_id int unsigned not null default 0 -- count of threads in each forum
)engine=innodb;


drop table if exists threads;
create table threads
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
reply_count int unsigned not null default 0,
hash char(32) not null,
created_date datetime not null,
primary key (forum_id, thread_id, reply_count) -- composite clustered index
)engine=innodb;

delimiter #

create trigger threads_before_ins_trig before insert on threads
for each row
begin
declare v_id int unsigned default 0;

  select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
  set new.thread_id = v_id;
  update forums set next_thread_id = v_id where forum_id = new.forum_id;
end#

delimiter ;

You may have noticed I've included reply_count as part of the primary key which is a bit strange as (forum_id, thread_id) composite is unique in itself. This is just an index optimisation which saves some I/O when queries that use reply_count are executed. Please refer to the 2 links above for further info on this.

Example queries

I'm still loading data into my example tables and so far I have a loaded approx. 500 million rows (half as many as your system). When the load process is complete I should expect to have approx:

250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)

I've deliberately made some of the forums contain more than 5 million threads for example, forum 65 has 15 million threads:

forum_id    title   next_thread_id
========    =====   ==============
65        forum 65      15000000 -- ooh a big one

Query runtimes

select sum(next_thread_id) from forums;

sum(next_thread_id)
===================
539,155,433 (500 million threads so far and still growing...)

under innodb summing the next_thread_ids to give a total thread count is much faster than the usual:

select count(*) from threads;

How many threads does forum 65 have:

select next_thread_id from forums where forum_id = 65

next_thread_id
==============
15,000,000 (15 million)

again this is faster than the usual:

select count(*) from threads where forum_id = 65

Ok now we know we have about 500 million threads so far and forum 65 has 15 million threads - let's see how the schema performs :)

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;

runtime = 0.022 secs

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;

runtime = 0.027 secs

Looks pretty performant to me - so that's a single table with 500+ million rows (and growing) with a query that covers 15 million rows in 0.02 seconds (while under load !)

Further optimisations

These would include:

  • partitioning by range

  • sharding

  • throwing money and hardware at it

etc...

hope you find this answer helpful :)

这篇关于MySQL和NoSQL:帮助我选择正确的一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆