每天数百万新行的数据库架构 [英] Database architecture for millions of new rows per day

查看:175
本文介绍了每天数百万新行的数据库架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为大量网站实施自定义开发的网站分析服务。这里的主要实体是:

I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:


  • 网站

  • 访客

每个唯一身份访问者在数据库中都有一行,包括着陆页,时间,操作系统,浏览器,引荐来源网址,IP等信息。

Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.

我需要对此数据库执行聚合查询,例如COUNT所有具有Windows作为操作系统并来自Bing.com的访问者

I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'

我有数百个网站要跟踪,这些网站的访问者数量从每天几百天到每天几百万。我总共希望这个数据库每天增长大约100万行。

I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.

我的问题是:

1)MySQL是一个很好的数据库吗?

1) Is MySQL a good database for this purpose?

2)什么是好的架构?我想为每个网站创建一个新的表。或者可以从一个表开始,然后生成一个新表(每天),如果现有表中的行数超过100万(是我的假设正确)。我唯一担心的是,如果表增长太大,SQL查询可能会显着减慢。那么,我应该存储每个表的最大行数是多少?此外,MySQL可以处理的表的数量是有限制的。

2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.

3)是否建议对数百万行执行聚合查询?我准备等待几秒钟以获得这些查询的结果。这是一个好的做法还是有其他方法来执行聚合查询?

3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?

简而言之,我正在设计一个大规模的数据仓库类型的设置将写入很重

推荐答案

如果你说的是大量的数据,然后查看 MySQL分区。对于这些表,按数据/时间划分的分区肯定有助于性能。有一篇关于分区此处的体面文章。

If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.

查看创建两个单独的数据库:一个用于具有最小索引的写入的所有原始数据;第二用于使用所述聚合值来报告;

Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.

编辑

如果您想要对汇总报告非常聪明,请创建一组聚合表(今天,周至今,月至日,年)。从原始数据到今天每天或实时聚合;每天从按日到周到日;从周到日到月到日,等等。在执行查询时,加入(UNION)您感兴趣的日期范围的相应表。

If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.

EDIT#2

而不是每个客户端一个表,我们使用每个客户端一个数据库模式。根据客户端的大小,我们可能在单个数据库实例中有多个模式,或者每个客户端有一个专用的数据库实例。我们使用单独的模式来进行原始数据收集,以及用于每个客户端的聚合/报告。我们运行多个数据库服务器,将每个服务器限制为单个数据库实例。对于弹性,数据库在多个服务器上进行复制,并进行负载平衡以提高性能。

Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.

这篇关于每天数百万新行的数据库架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆