可处理大于500万行的数据库 [英] Database that can handle >500 millions rows

查看:512
本文介绍了可处理大于500万行的数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个可以处理的数据库(在合理的时间内在列上创建索引,并在少于3秒内为选择查询提供结果)超过500万行。将PostgreSQL或Msql在低端机器(Core 2 CPU 6600,4GB,64位系统,Windows VISTA)处理这么大数量的行?



更新:问题,我正在寻找信息哪个数据库我应该使用在低端机器,以提供结果选择问题与一个或两个字段中指定where子句。无连接。我需要创建索引 - 它不能像在mysql上的年龄 - 实现足够的性能我的选择查询。



表格架构:

 <$> c $ c> create table mapper {
key VARCHAR(1000),
attr1 VARCHAR(100),
attr1 INT,
attr2 INT,
value VARCHAR ),
PRIMARY KEY(key),
INDEX(attr1),
INDEX(attr2)
}

解决方案

查询时间完全取决于更多的因素,而不仅仅是简单的行计数。



例如,它将取决于:


  1. 这些查询有多少联接

  2. 您的索引设置有多好


  3. 硬盘驱动器的类型和主轴速度

  4. 行的大小/查询中返回的数据量

  5. 网络界面速度/延迟

很容易有一个小(少于10,000行)表,需要几分钟时间来执行查询。例如,使用大量连接,where子句中的函数,以及具有512MB总RAM的Atom处理器上的零索引。 ;)



需要更多的工作来确保所有的索引和外键关系是好的,你的查询被优化以消除不必要的函数调用,你实际需要的数据。此外,你需要快速的硬件。



这一切都归结为你想花多少钱,开发团队的质量和数据的大小



UPDATE
由于问题的更改而更新。



这里的信息量仍然不足以给出真实世界的答案。你将只需要测试它,并根据需要调整你的数据库设计和硬件。



例如,我可以很容易地有一个表中的10亿行机器,并运行一个select top(1)id from tableA(nolock)查询,并得到一个答案以毫秒为单位。同样,你可以执行一个select * from tablea查询,它需要一段时间,因为虽然查询执行得很快,但是传输所有数据需要一些时间。



点是,你必须测试。这意味着,设置服务器,创建一些表,并填充它们。然后你必须通过性能调整来获得你的查询和索引。作为性能调优的一部分,你将发现,不仅查询需要重组,而且确切地说,机器的哪些部分可能需要被替换(例如:disk,更多ram,cpu等)基于锁和等待类型。



我强烈建议您雇用(或签约)一两个DBA来为您完成此操作。


I am looking for a database that could handle (create an index on a column in a reasonable time and provide results for select queries in less than 3 sec) more than 500 millions rows. Would Postgresql or Msql on low end machine (Core 2 CPU 6600, 4GB, 64 bit system, Windows VISTA) handle such a large number of rows?

Update: Asking this question, I am looking for information which database I should use on a low end machine in order to provide results to select questions with one or two fields specified in where clause. No joins. I need to create indices -- it can not take ages like on mysql -- to achieve sufficient performance for my select queries. This machine is a test PC to perform an experiment.

The table schema:

 create table mapper {
        key VARCHAR(1000),
        attr1 VARCHAR (100),
        attr1 INT,
        attr2 INT,
        value VARCHAR (2000),
        PRIMARY KEY (key),
        INDEX (attr1), 
        INDEX (attr2)   
    }

解决方案

MSSQL can handle that many rows just fine. The query time is completely dependent on a lot more factors than just simple row count.

For example, it's going to depend on:

  1. how many joins those queries do
  2. how well your indexes are set up
  3. how much ram is in the machine
  4. speed and number of processors
  5. type and spindle speed of hard drives
  6. size of the row/amount of data returned in the query
  7. Network interface speed / latency

It's very easy to have a small (less than 10,000 rows) table which would take a couple minutes to execute a query against. For example, using lots of joins, functions in the where clause, and zero indexes on a Atom processor with 512MB of total ram. ;)

It takes a bit more work to make sure all of your indexes and foreign key relationships are good, that your queries are optimized to eliminate needless function calls and only return the data you actually need. Also, you'll need fast hardware.

It all boils down to how much money you want to spend, the quality of the dev team, and the size of the data rows you are dealing with.

UPDATE Updating due to changes in the question.

The amount of information here is still not enough to give a real world answer. You are going to just have to test it and adjust your database design and hardware as necessary.

For example, I could very easily have 1 billion rows in a table on a machine with those specs and run a "select top(1) id from tableA (nolock)" query and get an answer in milliseconds. By the same token, you can execute a "select * from tablea" query and it take a while because although the query executed quickly, transferring all of that data across the wire takes awhile.

Point is, you have to test. Which means, setting up the server, creating some of your tables, and populating them. Then you have to go through performance tuning to get your queries and indexes right. As part of the performance tuning you're going to uncover not only how the queries need to be restructured but also exactly what parts of the machine might need to be replaced (ie: disk, more ram, cpu, etc) based on the lock and wait types.

I'd highly recommend you hire (or contract) one or two DBAs to do this for you.

这篇关于可处理大于500万行的数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆