设计大量数据的问题 [英] design issues with large amounts of data

查看:51
本文介绍了设计大量数据的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在修补一个数据收集系统,并以一种非常黑客的方式来存储我的数据 - 以b / b
为参考,我是

预计每年收集至少1亿个不同的数据每年可能会更多。


--- 366个数据表(每天一个)每年),每一行

分配一个唯一的DataId(在所有366个表中也是唯一的)

--- 100个data_map表,表0的所有DataIds都以00结尾,表格

99,所有DataIds都以99结尾等等。


这主要是因为我的一位与mySQL合作的朋友说它

索引大表的速度非常慢,即使你主要使用

整数。

但是,我读过mysql可以处理数百万行没有问题,所以它好像b $ b似乎我的基本设计过于复杂,并且由于所有的连接而导致大量的

减速。

另一个朋友我的建议使用文件分区(虽然他是
使用MSSQL),那么另一种选择吗?


有什么建议吗?

I''m tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I''m
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I''ve read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?

推荐答案

2008年7月3日星期四18:55:08 -0700(PDT),nflacco

< ma ******** *@gmail.com写道:


每天一张桌子是一个非常糟糕的想法。

请记住,索引查找非常有效。使用32次比较可以找到1条记录中的4条b / b

当然,索引该表(如果之前没有索引)可能需要

虽然但不一定要做多次。


为什么不用现实硬件进行一些性能测试

现实数据集。


-Tom。

On Thu, 3 Jul 2008 18:55:08 -0700 (PDT), nflacco
<ma*********@gmail.comwrote:

A table per day is a REALLY BAD IDEA.
Remember that indexed lookups are very efficient. 1 record out of 4
billion can be found using 32 comparisons.
Sure, indexing that table (if there was no index before) might take a
while but doesn''t have to be done more than once.

Why don''t you run some performance tests on realistic hardware with a
realistic data set.

-Tom.


>我正在修补一个数据收集系统,并以一种非常hackish的方式存储我的数据 - 供参考,我预计每年至少收集1亿个不同的数据。可能更多。

--- 366个数据表(一年中每天一个),每行被分配一个唯一的DataId(在所有366个表中也是唯一的) --- 100个data_map表,表0的所有DataIds都以00结尾,表格99将所有DataIds以99结尾,依此类推。

这主要是因为朋友我和mySQL一起工作的人说,即使你使用的是大整数的表格,也很难对大型表进行索引。
但是,我读过mysql可以处理数百万行没有问题,所以它似乎我的基本设计过于复杂,并且由于所有的连接将导致大量的减速。
我的另一位朋友建议使用文件分区(尽管他
使用MSSQL),这是另一种选择吗?

有什么建议吗?
>I''m tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I''m
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I''ve read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?


我的另一位朋友建议使用文件分区(虽然他
Another friend of mine suggested using file partioning ( though he

使用MSSQL),这是另一种选择吗?
uses MSSQL ), so is that another option?



分区适用于管理超大型表,因为您可以重建

单个分区索引而无需触及整个表。这个

减少了重建时间和中间空间要求。请注意,

分区功能仅适用于企业版和开发人员版。


凭借良好的索引策略,理想情况下响应时间应成比例

到检索的数据量(禁止缓存数据),无论是否使用
或不使用分区。按日期分区可以促进某些

流程,例如增量数据加载和清除/存档以及某些类型的查询。但是,无论是否进行分区,从性能角度来看,索引是

键。


-

希望这会有所帮助。


Dan Guzman

SQL Server MVP
http://weblogs.sqlteam.com/dang/

" nflacco" < ma ********* @ gmail.comwrote in message

news:9e ********************* ************* @ d19g2000 prm.googlegroups.com ...

Partitioning is good for managing very large tables because you can rebuild
individual partition indexes without touching the entire table. This
reduces rebuild time and intermediate space requirements. Be aware that the
partitioning feature is available only in Enterprise and Developer editions.

With a good indexing strategy, response time should ideally be proportional
to the amount of data retrieved (barring cached data) regardless of whether
or not partitioning is used. Partitioning by date can facilitate certain
processes, like incremental data loads and purge/archival as well as certain
types of queries. However, with or without partitioning, indexing is the
key from a a performance perspective.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/

"nflacco" <ma*********@gmail.comwrote in message
news:9e**********************************@d19g2000 prm.googlegroups.com...


我正在修补数据收集系统,并以一种非常hackish的方式存储我的数据 - 以
为参考,我是

期待收集至少1亿个不同的数据。
每年
whatevers,可能更多。


--- 366个数据表(一年中每天一个),每一行是

分配了一个唯一的DataId(在所有366个表中也是唯一的)

--- 100个data_map表,表0的所有DataIds都以00结尾,表

99包含所有DataIds结束于99等等。


这主要是因为我的一位与mySQL合作的朋友表示,对于大型桌子索引来说,
非常慢,甚至如果你主要使用

整数。

但是,我读过mysql可以处理数百万的ro没有问题,所以它好像b $ b似乎我的基本设计过于复杂,并且由于所有的连接,将导致大量的

减速。

我的另一位朋友建议使用文件分区(虽然他/ b $ b使用MSSQL),那么另一种选择是什么?


有什么建议吗?
I''m tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I''m
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I''ve read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?


7月3日晚上8:14,Dan Guzman < guzma ... @nospam-online.sbcglobal.net>

写道:
On Jul 3, 8:14 pm, "Dan Guzman" <guzma...@nospam-online.sbcglobal.net>
wrote:

另一个我的朋友建议使用文件分区(虽然他/ b $ b使用MSSQL),那还有另一种选择吗?
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?



分区适用于管理超大型表,因为您可以重建

单个分区索引而无需触及整个表。这个

减少了重建时间和中间空间要求。请注意,

分区功能仅适用于企业版和开发人员版。


凭借良好的索引策略,理想情况下响应时间应成比例

到检索的数据量(禁止缓存数据),无论是否使用
或不使用分区。按日期分区可以促进某些

流程,例如增量数据加载和清除/存档以及某些类型的查询。但是,无论是否进行分区,从性能角度来看,索引是

键。


-

希望这会有所帮助。


Dan Guzman

SQL Server MVPhttp://weblogs.sqlteam.com/dang/


" nflacco" < mail.fla ... @ gmail.comwrote in message


新闻:9e ********************* ************* @ d19g2000 prm.googlegroups.com ...


Partitioning is good for managing very large tables because you can rebuild
individual partition indexes without touching the entire table. This
reduces rebuild time and intermediate space requirements. Be aware that the
partitioning feature is available only in Enterprise and Developer editions.

With a good indexing strategy, response time should ideally be proportional
to the amount of data retrieved (barring cached data) regardless of whether
or not partitioning is used. Partitioning by date can facilitate certain
processes, like incremental data loads and purge/archival as well as certain
types of queries. However, with or without partitioning, indexing is the
key from a a performance perspective.

--
Hope this helps.

Dan Guzman
SQL Server MVPhttp://weblogs.sqlteam.com/dang/

"nflacco" <mail.fla...@gmail.comwrote in message

news:9e**********************************@d19g2000 prm.googlegroups.com...


我正在修补数据收集系统,并以一种非常hackish的方式存储我的数据 - 以
为参考,我是

期待收集至少1亿个不同的数据。
每年
whatevers,可能更多。
I''m tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I''m
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.


--- 366个数据表(一年中每天一个),每一行

分配一个唯一的DataId(在所有366个表中也是唯一的)

--- 100个data_map表,表0所有DataIds以00结尾,表

99,所有DataIds以99结尾等等。
---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.


这主要是因为我的一位与mySQL合作的朋友说它是b / b
索引大表的速度非常慢,即使你主要使用

整数。

但是,我读过mysql可以处理数百万行没问题,所以它

似乎我的基本设计过于复杂,并且由于所有连接,将导致大量的减少。

我的另一位朋友建议使用文件分区(尽管他
使用MSSQL),这是另一种选择吗?
This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I''ve read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?


有什么建议吗?
Any advice?



重新编制索引让我很担心。如果我们按照不使用

过多的表格方案,我将不断向主表(以前的366天表)添加新数据

,以及处理数据表。


The re-indexing is what worries me. I''ll be constantly adding new data
to main table ( formerly the 366 day tables ) if we follow the not use
too many tables scheme, as well as the processed-data tables.
(


这篇关于设计大量数据的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆